Corpora: Re morphological/PoS ambiguity

Eckhard Bick (lineb@hum.aau.dk)
Wed, 11 Feb 1998 19:32:29 +0100

Yet another remark on Portuguese morphology, - since it's not the
language most widely discussed here, I find the courage to add one more
short remark ... apologies for the picture, I tried to make it small.

I fully agree with Diana Santos on that any tag set is ultimately
dependent on the notational/theoretical view taken. Portmanteau tags do
make sense where no or very few lexical items make a certain distinction
morphologically visible. A good (though only Brazilian) Portuguese
example is the lack of tense distinction between present tense and past
tense (perfeito simples) in the first person plural of -ar verbs. Here,
even syntactic context will often be of no great help, since the
distinction has to be based on contextualized discourse analysis,
difficult to achieve with, say, a sentence window, as used in most
automatic parsers. The same holds for the use of the subjunctive with
"imperative function "(i.e. 'venda', base form *'vender'*), but I would
not say the same about 'venda', base form *'vendar'* (the form found by
my parser), since the second person a) is the natural candidate for
imperative function, and b) has a distinct plural form ('vendai'). Below
the relevant Constraint Grammar input cohort (without valency- or
semantic distinctions):

venda
[venda] N F S
[vender] V PR 1/3S SUBJ
[vendar] V IMP 2S
[vendar] V PR 3S IND

As to whether tagging implies shallow syntax and can go beyond mere
lexical information - yes, definitely, - the Constraint Grammar approach
of flat dependency syntax shows that even full syntactic function can be
added as tags to word forms, using the same rule based technique as for
the disambiguation of lexico-morphological information. The borderline
between tagging (word form based PoS and morphology) and parsing
(constituent based syntactic analysis) is very soft indeed. Thus, CG's
flat "tagged" syntactic analysis can automatically be transformed into
traditional constituent trees, even on running text (parser generated
tree-picture attached).

O [o] <art> DET M S @>N 'den'
governo [governo] <HH> N M S @SUBJ> 'regering'
facilitou [facilitar] <vt> <vH> <fmc> V PS 3S IND VFIN @FMV 'lette-1'
a [a] <art> DET F S @>N 'den'
venda [venda] <CP> <inst> <hus> N F S @<ACC 'salg'
de [de] PRP @N< 'af'
empresas [empresa] N F P @P< 'industriforetagende'

(The goverment facilitated the sale of companies.)

A last theoretical remark: If the morphological and the syntactic levels
are really to be kept apart, then a purely morphological definition of
word classes would help to make things "purer" from a linguistic point
of view. For Portuguese, for example, nouns might be defined as words
with gender as a lexeme category (well, mostly) and number as a word
form category, whereas adjectives have both as word form categories, and
proper nouns would constitute their own class with both gender and
number being lexeme categories; (cardinal) numerals, finally have (in
Portuguese) gender as a word form category and number as a lexeme
category. I.e. four combinatorial possibilities for 2 categories yield
four distinct morphological word classes. Many traditional word classes
are really syntactic, like when one wordform gets readings as either
averb, conjunction and preposition, according to its function in the
sentence. Not to speak of "adjectival"/"substantival" determiner
pronouns or "adjectival"/"verbal" participles.

Eckhard

--
Eckhard Bick, cand.med., cand.mag.
web (Portuguese grammar): http://visl.hum.ou.dk/Linguistics.html
work: Dpt. of Linguistics, Århus University, Tel. +45 89422131
home: Rugbjergvej 98, DK-8260 Viby J, Tel. +45 86283524, Fax. 1397