Re: Corpora: Frequency Meaning

From: ramesh@clg.bham.ac.uk
Date: Thu Feb 17 2000 - 23:49:24 MET

  • Next message: ramesh@clg.bham.ac.uk: "Corpora: "language engineering": corpus evidence"

    Dear Dr Gomez
    Cobuild used corpus lemma frequencies in their Dictionary (2nd edition,
    1995). We devised a 5-band distinction, with 700 lemmas in the most frequent band, 1200 in the 2nd band, 1500 in the 3rd band, 3200 in the 4th and 8100 in the 5th. I can't remember the exact frequency cut-offs used, but I'm confident
    that most users of the dictionary have found it a very useful addition.
    The exact cut-off points might be affected by the size of the corpus, and
    may also be language dependent (in a highly inflected language like Spanish, there might be different relationships between some types and lemmas when compared
    to a realtively uninflected language like English). Also the purpose of
    your classification may affect your decisions. For a dictionary, lemma is
    presumably more important than type, although type distribution within a
    lemma may influence whether a form is treated under the main lemma form,
    or is given separate headword status (e.g. "situated" in an English dictionary
    may be a separate headword, as well as being an inflected form under the headword "situate"; similarly "painting" and "paint"; word-class shifts would also
    have to be taken into account.).
    Hope this helps.
    Ramesh

    Ramesh Krishnamurthy
    Honorary Research Fellow
    Corpus Research Group
    University of Birmingham



    This archive was generated by hypermail 2b29 : Thu Feb 17 2000 - 23:47:31 MET