Corpora: Case/number distribution

Sean Boisen (sboisen@bbn.com)
Tue, 1 Dec 1998 15:29:47 -0500

I'm looking for references to work on the distribution of forms across
inflectional categories in languages with case systems. For example, Modern
Greek (according to Joseph, in _The World's Major Languages_, ed. by Comrie
1990) has 4 cases, and two numbers, meaning a given noun could occur in as
many as eight different forms. The actual number of forms possible varies
according to the declension: masculine o-stem nouns have 7 distinct forms
(the nominative and vocative plurals are the same), the other declensions
apparently only have four distinct forms. If there are Greek corpora marked
with a part-of-speech inventory that distinguishes case and number, of
course, all 8 possibilities could be distinguished.

I presume (without any real evidence) that words in normal usage are not
evenly distributed across these cases: for example, i'd assume the vocative
singular is much less frequent, at least in news text, and the vocative
plural very rare indeed. I presume the nominative case would be the most
frequent, but if so, how much more frequent than the accusative or genitive?

If you have references, unpublished findings, or even informed speculations
about the distributional facts for Greek/Russian/whatever case language
you've got, i'd appreciate hearing them.

Sean Boisen
Senior Scientist, BBN Technologies
sboisen@bbn.com