[Corpora-List] Re: rare words

From: FIDELHOLTZ DOOCHIN JAMES LAWRENCE (jfidel@siu.buap.mx)
Date: Wed Jun 18 2003 - 18:12:54 MET DST

  • Next message: Khalid CHOUKRI: "RE: [Corpora-List] Legal aspects of compiling corpora"

    N M Chipere wrote:

    > Is anyone familiar with the issues surrounding the definition and
    > measurement of word rarity? My colleagues and I are currently treating
    > the first two thousand most frequent words in English as common words and
    > the rest as rare (excluding proper nouns, numerals, etc). Apart from the
    > issue of where one puts the cut-off point, there is an obvious problem to
    > do with homographs, for which we don't have a simple solution.
    > Ngoni
    >
    > *********************************************************************
    > Dr Ngoni Chipere
    > Institute of Education
    > The University of Reading
    > Reading
    > Berkshire RG6 1HY
    >
    > tel: 0118 987 5123 x 4943
    > **********************************************************************

    Hi Ngoni,

    Well, what's 'rare' depends on what you are doing with it, or on your
    perspective, or both, and/or other things. For example, in a 1975 article
    ('Word frequency and vowel reduction in English', Chicago linguistic
    society. Annual meeting. Papers 11.200-213), I found that in a certain
    environment (first syllable, before consonant clusters not beginning with a
    nasal or whose second member is a liquid), reduction of unstressed lax
    vowels occurred in 'frequent' words, where 'frequent' is defined as
    occurring over about 5 times per M words (I think rather more than the first
    2000 most frequent words--I used Thorndike & Lorge for frequency counts).
    In the same environment, but before clusters with an initial nasal
    consonant, the same thing happens, but 'frequent' for this environment is
    much higher, probably well over 50/M, which probably corresponds to fewer
    than the first 1000 most frequent words (I haven't checked out the
    correspondences exactly between 'most frequent' and 'N per million'). In
    other cases (eg unstressed vowels before clusters between stressed
    syllables), reduction is much easier, and even general except for some
    homonymy issues in relatively rare words, eg 'ex_or_cize' (usually no
    reduction before the movie Exorcist came out) vs. 'ex_er_cize' (always
    reduced).

    Homographs for the first few thousand most frequent words can be roughly
    checked for the frequency of their 'parts' by checking a dittoed work by
    Lorge & Thorndike (Lorge, Irving and Edward L. Thorndike. 1938. A
    sernantic count of English words. NY: The Institute of Educational
    Research, Teachers College, Columbia University). This had a run of about
    100 copies and can be found in major libraries (I believe the British
    National Library, or whatever it's called, has one). A derivative work from
    L&T is: West, Michael P. (compiler & ed.) 1953. A general service list of
    English words with semantic frequencies. NY: Longmans, Green & Co.

    By the way, these frequencies in T&L and L&T are obviously close to 70 years
    old. I don't think that matters much, since such relative frequencies A)
    change very slowly, as far as I can tell; and B) are pretty heavily
    corpus-dependent, anyway. Still, there are much more recent things around
    if you're worried about that stuff.

    Well, I hope this is some help. All in all, not an easy problem, and very
    dependent on your aims.

    Jim

    James L. Fidelholtz
    Posgrado en Ciencias del Lenguaje
    Benemérita Universidad Autónoma de Puebla MÉXICO



    This archive was generated by hypermail 2b29 : Wed Jun 18 2003 - 18:11:42 MET DST