Corpora: A little late: Size of representative corpus

Iain Downs (idowns@dircon.co.uk)
Thu, 27 Aug 1998 09:19:18 +0100

A little late, I'm afraid, but another perspective on a 'representative
corpus'. This time from the perspective of learning English.

I should say that I have NO formal knowledge of this subject so I look
forward to corrections!

I assume that a child has learnt english to tolerable competance by the age
of 5.

I assume that that child has been exposed to 100 words a minute, 10 hours a
day and 350 days a year.

That is (ONLY!!) 21 milion words.

I also assume that a 15 year old has acquired a high level of competency
with some specialisation. Assuming the childs exposure rate has gone up to
200 words a minute for 14 hours a day (this allows for more chatting, TV
and books - perhaps an underestimate).

This adds around 500 million words to our corpus!

These figures, however, do not allow for the 'external' stimuli in learning
(Oh THATS a 'mummy'), nor the role of prosody and the like but also assume
no external expertise (Lexicographers and Corpus linguists!).

However, perhaps this sort of analyis can at least put some bounds on the
necessary size of a Corpus to be ABLE to learn a language.

Any thoughts?

Iain