Corpora: Using corpus data to establish learner vocabularies

Nick Youd (nick@logcam.co.uk)
Tue, 18 Nov 1997 12:01:15 +0000 (GMT)

I quite like the idea of using corpus frequency as a criteria in
establishing word lists for language learners. Does anyone know which
other languages this has been done for?
Best wishes
Nick

> ----------
> From: Alice Carlberger <alice@speech.kth.se>
> Sent: 17 November 1997 15:29
> To: corpora@hd.uib.no
> Subject: Corpora: Elementary vocabulary sizes
>
> Dear Corpora Listers,
>
> A while back I posted the following query:
>
> >Could anyone tell me what is considered to be the number of words a
> >speaker of a new language needs to know to "get by" in that
language?
> >Or how many words a person needs to be able to read a newspaper?
I'm
> >mostly interested in Swedish.
>
> I also posted a summary of responses 10/21/97. Now, here is
additional
> useful information from Richard Piepenbrock, which I would like to
> share.
>
> Best regards,
>
> Alice Carlberger
>
> At 14:53 1997-11-13 +0100, you wrote:
> >Dear Ms. Carlberger,
> >
> >This reply to your query about elementary vocabulary size is rather
> >late in coming, but I thought it would be interesting to post
> >nonetheless.
> >
> >In the Netherlands and Flanders (Dutch-speaking part of Belgium),
an
> >elementary word list of 890 lemmata (i.e. mainly dictionary
entries,
> >including complex and some compund words, but no flections) is
> >considered the norm for passing the exam for the official
Certificate
> >of Dutch as a Foreign Language at the lowest level. At the
> >intermediate level of this certificate, which is called 'basic',
> about
> >2000 lemmata is considered sufficient. This encompasses the
> elementary
> >list, and has been published for reference as the 'Basic Dutch
> >Dictionary' (Basiswoordenboek Nederlands). The authors of this
volume
> >state that this size is largely equivalent to international
> standards.
> >And indeed, although I don't know the standards for the Certificate
> of
> >English exams, the Longman Dictionary of Contemporary English, a
> >learner's dictionary, makes use of a defining vocabulary of the
same
> >size, to make its definitions as transparent to learners as
possible.
> >
> >The norm for including a lemma in the elementary list is a
frequency
> >threshold of >=3D 30 per 100,000 tokens of running corpus text. The
> >weight attached to the relatively small spoken part of the corpus
was
> >increased in calculating these numbers in order to increase the
> >representation of everyday speech. For the basic (=3D intermediary)
> >level, the threshold is lowered to >=3D 5 per 100,000 tokens, but
> then
> >with the added criterion of a dispersion of more than 1, i.e. it
> >should occur in more than one subpart of the complete corpus.
> >Understandably, some items were added manually at a later stage,
e.g.
> >to complete the set of numbers, days of the week and months.=20
> >
> >>From what I have read, it is generally acknowledged that frequency
> >thresholds should not be taken as the sole criterion for inclusion.
> >Also word familiarity and imageability ratings should come into
play
> >here. As the literature I know of is either rather dated or in
Dutch,
> >I'm afraid I cannot give you any specific literature references.
> >
> >Regards,
> >
> >Richard Piepenbrock
> >CELEX - The Centre for Lexical Information
> >Max Planck Institute for Psycholinguistics
> >Nijmegen, the Netherlands
>
>
----------------------------------------------------------------------
> ------
> Alice Carlberger E-mail:
> alice@speech.kth.se
> KTH (Royal Institute of Technology) Phone: +46 8 790 75 62
> TMH (Dept. of Speech, Music and Hearing) Fax: +46 8 790 78 54
> Drottning Kristinas v=E4g 31
> S-100 44 Stockholm
> Sweden