Re: Corpora: Size of a representative corpus

Pascual Cantos Gomez (pcantos@fcu.um.es)
Fri, 21 Aug 1998 22:08:00 +0200

At 19:39 19/08/98 -0300, you wrote:
>Hi,
>
>The question of how large (in tokens) a representative corpus
>must be came up in our classes and one of the possibilities
>we came up with would be to think about this issue as follows:
>
>'A representative corpus should include the majority of the types
>in the language as recorded in a comprehensive dictionary.
>Thus:
>(a) assuming that a dictionary entry is analogous to a type;
>(b) dictionary x is comprehensive
>(c) dictionary x has 100,000 entries
>(d) a majority is 1/2 + 1
>A representative corpus would need to have as many tokens
>as necessary to include 50,001 types.'

It would be useful if you start by making a distinction between
lemma/lexeme, type and token. Consider the following word sequence: "plays,
playing, played, play, plays, play, playing, played and played", where we
have nine words (tokens), four word forms (types) and one lemma, namely
"play" (watch out Sinclair's nice explanation in:
Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford
University Press; sorry but can't tell you the page number right now).

>Since there are no references to this hypothesis in the literature
>(or is there?) we would like to know people's reactions to it:
>Would this be a proper criterion? What are the possible
>flaws in the argument?

Read Biber's 1993 article on this issue:
Biber, D. 1993. "Representativeness in Corpus Design." Literary and
Linguistic Computing 8 (4): 243-257.

>Also, how could we estimate the number of tokens needed
>to make up for 50,001 types?

There is a transitive relationship between lemmas, types and tokens subject
to be mathematically modelled. This holds for the research we carried out
for Spanish and English. Our analytic technique for predicting types and
lemmas is simple and straightforward and the resulting formulae are easy to
use, flexible and can be applied quickly to any corpora or language samples
(at least for Spanish and English). You can find the formulae and the
discussion in our recently published article:

Sánchez, A. and P. Cantos (1997) "Predictability of Word Forms (Types) and
Lemmas in Linguistic Corpora. A Case Study Based on the Analysis of the
CUMBRE Corpus: An 8-Million-Word Corpus of Contemporary Spanish".
International Journal of Corpus Linguistics 2(2): 259-280.
(See abstract http://solaris3.ids-mannheim.de/~ijcl/ijcl-2-2.html).

In addition, there is a forthcoming article (in Spanish), where we carried
out a comparison between English and Spanish regarding types and lemmas
growth and predictability (if interested, I can forward you a copy).

Sánchez, A. and P. Cantos (forthcoming) "El ritmo incremental de palabras
nuevas en los repertorios de textos. Estudio experimental y comparativo
basado en dos corpus lingüísticos equivalentes de cuatro millones de
palabras, de las lenguas inglesa y española y en cinco autores de ambas
lenguas". ATLANTIS (Artículo Monográfico), vol. XIX (2).

Un saludo

Pascual

___________________________________________________

Dr. Pascual Cantos Gómez

Departamento de Filología Inglesa
Universidad de Murcia
C./ Santo Cristo, 1
30071 Murcia - SPAIN

Tel: 968 364365 - +34 968 364365
Fax: 968 363185 - +34 968 363185
E-mail: pcantos@fcu.um.es