[Corpora-List] Corpus size for lexicography

From: Ramesh Krishnamurthy (ramesh@easynet.co.uk)
Date: Tue Oct 01 2002 - 01:32:07 MET DST

  • Next message: P bI K O B_ B.B.: "Re: [Corpora-List] CL - bright examples"

    Dear Robert Amsler

    I am concerned that your statements regarding
    corpus sizes for lexicographic purposes might be
    *highly* misleading, at least for English:
    > 1 million words we now know to be quite small
    > (adequate only for a Pocket Dictionary worth of entries).
    > Collegiate dictionaries require at least a 10 million word corpus, and
    > Unabridged dictionaries at least 100 million words (the target of the ANC).

    1. From my experience while working for Cobuild at Birmingham University:

    a) approx. half of the types/wordforms in most corpora have only one token (i.e. occur only once):
    e.g. 213,684 out of 475,633 in the 121m corpus (1993); 438,647 out of 938,914 in the 418m corpus (2000).

    b) dictionary entries cannot be based on one example; so let us say you need at least 10 examples
    (a very modest figure; in fact, as our corpus has grown, and our software and understanding has
    become more sophisticated, the minimum threshold increases for some linguistic phenomena, as
    we find that we often require many more examples before particular features/patterns even
    become apparent, or certain statistics become reliable)

    c) many types with 10+ tokens will not be included in most dictionaries (e.g. numerical entities,
    proper names, etc; some may be included in the dictionary, e.g. 24-7, the White House, etc,
    depending on editorial policy; the placement problem for numerical entities is a separate issue)

    d) there are roughly 2.2 types per lemma (roughly equal to a dictionary headword) in English
    (the lemma "be" has c. 18 types, including some archaic ones and contractions; most verbs have
    4 or 5 types; at the other end of the scale, many uncount nouns and adjectives, most adverbs and
    grammatical words, have only one type); of course some types may belong to lemmas, but will
    need to be treated as headwords in their own right, for sound lexicographic reasons.

    2. Calculating potential dictionary headwords from corpus facts and figures:

    a) In the 18m Cobuild corpus (1986), there were 43,579 types with 10+ tokens.
    Dividing by 2.2, we get c. 19,800 lemmas with 10+ tokens, i.e. potential dictionary headwords

    b) In the 120m Cobuild Bank of English corpus (1993), there were
    99,326 types with 10+ tokens = c. 45,150 headwords

    c) In the 450m Bank of English corpus (2001), there were
    204,626 types with 10+ tokens = c. 93,000 headwords

    I don't think the Cobuild corpora are untypical for such rough calculations.

    3. Some dictionary figures:

    It is difficult to gauge from dictionary publishers' marketing blurbs exactly how many headwords
    are in their dictionaries, but here are a few figures taken from the Web today (unless otherwise stated).

    a) Pocket:
    Webster's New World Pocket = 37,000 entries

    b) Collegiate:
    New Shorter OED: 97,600 entries
    Oxford Concise: 220,000 words, phrases and meanings
    Webster's New World College: 160,000 entries
    (cf Collins English Dictionary 1992: 180,000 references)

    c) Unabridged:
    OED: 500,000 entries
    Random House Webster's Unabridged: 315,000 entries
    (cf American Heritage 1992: 350,000 entries/meanings)

    d) EFL Dictionaries
    (cf Longman 1995: 80,000 words/phrases)
    (cf Oxford 1995: 63,000 references)
    (cf Cambridge 1995: 100,000 words/phrases)
    (cf Cobuild 1995: 75,000 references)

    4. So, by my reckoning, the 100m-word ANC corpus (yielding less than
    45,000 potential headwords) will be adequate for a Pocket Dictionary, but
    will struggle to meet Collegiate requirements, and will be totally inadequate
    as the sole basis for an Unabridged Dictionary (if that really is the ANC's aim).

    Surely we will need corpora in the billions of words range before we can start to compile
    truly corpus-based Unabridged dictionaries. Until then, corpora can assist us in most
    lexicographic and linguistic enterprises, but we cannot say that they are adequate in
    size. It is no coincidence that corpora first became used for EFL lexicography, where
    the requirement in number of headwords is more modest. But even here, it took much
    larger corpora to give us reliable evidence of the range of meanings, grammatical
    patterning and collocational behaviour of all but the most common words.

    I have no wish to disillusion lexicographers working with smaller corpora. Cobuild's initial
    attempts in corpus lexicography entailed working with evidence from corpora of 1m and
    7m words. Many of those analyses remain valid in essence, even when checked in our
    450m word corpus. But we now have a better overview, and many more accurate details.
    Smaller corpora can be adequate for more restricted investigations, such as domain-specific
    dictionaries, local grammars, etc. But for robust generalizations about the entire lexicon, the
    bigger the corpus the better.

    Best
    Ramesh

    Ramesh Krishnamurthy
    Honorary Research Fellow, University of Birmingham;
    Honorary Research Fellow, University of Wolverhampton;
    Consultant, Cobuild and Bank of English Corpus, Collins Dictionaries.

    ----- Original Message -----
    X-Server-Uuid: 0bf4d294-faec-11d1-a39a-0008c7246279
    From: "Amsler, Robert" <Robert.Amsler@hq.doe.gov>
    To: corpora@hd.uib.no
    Subject: RE: [Corpora-List] ACL proceedings paper in the American
     National Corpus
    X-WSS-ID: 11865F7E745332-01-02
    X-checked-clean: by exiscan on alf
    X-Scanner: 60e8f1512d716b649753d2ad49fb5c4a http://tjinfo.uib.no/virus.html
    X-UiB-SpamFlag: NO UIB: -0.8 hits, 8 required;

    There is clearly an issue here regarding what the American National Corpus
    is trying to represent. The Brown Corpus tried to be "representative" by
    extracting equal-sized samples selected from all the publications of a given
    year. As has been found, it failed to adequately determine that all the
    texts were created by American authors and alas, 1 million words we now know
    to be quite small (adequate only for a Pocket Dictionary worth of entries).
    Collegiate dictionaries require at least a 10 million word corpus, and
    Unabridged dictionaries at least 100 million words (the target of the ANC).

    However, what I detect to this point from ANC literature is that they are first trying to fill the quota of 100 million words and only secondarily concerned about "balancing" the corpus for genre and sample sizes.

    Also, if I'm not mistaken, the Brown corpus didn't JUST balance for genres,
    it tried to balance for timespan. I.e., it tried to form a closed universe
    of possible publications and then representatively sample from that
    universe.
    This involves attempting to determine all the possible publications in that
    universe and then selecting a subset which represents them in both quantity
    and genre. While it may seem ambitious to first decide what is in the list
    of all available publications (especially, if your criterion for inclusion
    is merely "published after 1990"), it may be the only way to have a universe
    from which a truly random sample can be extracted.

    Note: Brown Corpus Manual http://www.hit.uib.no/icame/brown/bcm.html

    Robert A. Amsler
    robert.amsler@hq.doe.gov
    (301) 903-8823



    This archive was generated by hypermail 2b29 : Tue Oct 01 2002 - 10:41:59 MET DST