Re: Corpora: Size of a representative corpus

Markus Schulze (max@linguistik.uni-erlangen.de)
Thu, 20 Aug 1998 16:52:51 +0200 (METDST)

Jon Mills Writes
>Tony Berber Sardinha, writes
>> 'A representative corpus should include the majority of the types
>> in the language as recorded in a comprehensive dictionary.
>> Thus:
>> (a) assuming that a dictionary entry is analogous to a type;
>> (b) dictionary x is comprehensive
>> (c) dictionary x has 100,000 entries
>> (d) a majority is 1/2 + 1
>> A representative corpus would need to have as many tokens
>> as necessary to include 50,001 types.'
>
>A dictionary entry more usually relates to a lexeme and
>a lexeme may be realised by a number of types. One also
>has to consider how the dictionary that you are using
>treats derivatives (as run-ons or as separate entries).
>There is also a sort of circularity in the notion of
>"comprehensive dictionary". Isn't a "comprehensive
>dictionary" one that includes entries for the majority
>of lexical items found in the corpus?
>

Furthermore, the notion of "representativeness" of a corpus should
include the aspect of frequency of lexemes (or even free morphemes in
order to properly handle derivatives and compounds). If the aspect of
frequency is not regarded, you might as well just take the
comprehensive dictionary.

-------------- Abteilung für Computerlinguistik --------------
Markus Schulze
Bismarckstr. 6 fon: +49-9131-85-9252
91054 Erlangen fax: +49-9131-85-9251
----- www: http://uranus.linguistik.uni-erlangen.de/~max/ ----