Corpora: Elementary vocabulary sizes

Alice Carlberger (alice@speech.kth.se)
Mon, 17 Nov 1997 16:14:44 +0100

Dear Corpora Listers,

A while back I posted the following query:

>Could anyone tell me what is considered to be the number of words a
>speaker of a new language needs to know to "get by" in that language?
>Or how many words a person needs to be able to read a newspaper? I'm
>mostly interested in Swedish.

I also posted a summary of responses 10/21/97. Now, here is additional
useful information from Richard Piepenbrock, which I would like to share.

Best regards,

Alice Carlberger

At 14:53 1997-11-13 +0100, you wrote:
>Dear Ms. Carlberger,
>
>This reply to your query about elementary vocabulary size is rather
>late in coming, but I thought it would be interesting to post
>nonetheless.
>
>In the Netherlands and Flanders (Dutch-speaking part of Belgium), an
>elementary word list of 890 lemmata (i.e. mainly dictionary entries,
>including complex and some compund words, but no flections) is
>considered the norm for passing the exam for the official Certificate
>of Dutch as a Foreign Language at the lowest level. At the
>intermediate level of this certificate, which is called 'basic', about
>2000 lemmata is considered sufficient. This encompasses the elementary
>list, and has been published for reference as the 'Basic Dutch
>Dictionary' (Basiswoordenboek Nederlands). The authors of this volume
>state that this size is largely equivalent to international standards.
>And indeed, although I don't know the standards for the Certificate of
>English exams, the Longman Dictionary of Contemporary English, a
>learner's dictionary, makes use of a defining vocabulary of the same
>size, to make its definitions as transparent to learners as possible.
>
>The norm for including a lemma in the elementary list is a frequency
>threshold of >= 30 per 100,000 tokens of running corpus text. The
>weight attached to the relatively small spoken part of the corpus was
>increased in calculating these numbers in order to increase the
>representation of everyday speech. For the basic (= intermediary)
>level, the threshold is lowered to >= 5 per 100,000 tokens, but then
>with the added criterion of a dispersion of more than 1, i.e. it
>should occur in more than one subpart of the complete corpus.
>Understandably, some items were added manually at a later stage, e.g.
>to complete the set of numbers, days of the week and months.
>
>>From what I have read, it is generally acknowledged that frequency
>thresholds should not be taken as the sole criterion for inclusion.
>Also word familiarity and imageability ratings should come into play
>here. As the literature I know of is either rather dated or in Dutch,
>I'm afraid I cannot give you any specific literature references.
>
>Regards,
>
>Richard Piepenbrock
>CELEX - The Centre for Lexical Information
>Max Planck Institute for Psycholinguistics
>Nijmegen, the Netherlands

----------------------------------------------------------------------------
Alice Carlberger E-mail: alice@speech.kth.se
KTH (Royal Institute of Technology) Phone: +46 8 790 75 62
TMH (Dept. of Speech, Music and Hearing) Fax: +46 8 790 78 54
Drottning Kristinas väg 31
S-100 44 Stockholm
Sweden
---------------------------------------------------------------------------