Corpora: Summary: Frequency Bands

From: Pascual Cantos Gomez (pcantos@fcu.um.es)
Date: Fri Feb 25 2000 - 17:14:02 MET

  • Next message: Angela Kessell: "Corpora: corpora: driver navigation"

    Dear Corpus Linguists,

    I enclose a summary of all those who answered to my query on criteria to
    establish frequency bands.

    Many thanks to:

    Tony Berber Sardinha
    James L. Fidelholtz
    Eric Atwell
    Ramesh Krishnamurthy

    -----------------------------

    James L. Fidelholtz:

    Dear Pascual:
            I don't have any very recent info for you, but I did publish an
    article in 1976 on English vowel reduction, which contains some
    suggestive data for part of your question (at least for English,
    although I would have to be convinced that frequency phenomena are
    significantly different in this regard for different languages). Now,
    there is a pretty clear dividing line at about 4/M (plus or minus about
    3/M) between words with reduced vowels in certain environments, and
    vowels unreduced in those environments (of course, the more frequent
    words show a greater tendency toward reduction). It seems to me that
    this would probably correspond to the difference between 'medium' and
    'low', but a lot depends on how you define these categories. Here, the
    evidence is overwhelmingly strong, in my opinion. There is some fairly
    weak evidence (from other environments with relatively few examples) for
    another dividing line somewhere around 35-50/M, which might correspond
    to the 'moderate'/'high' division, although my feelings are less strong
    on various aspects of this decision.
            No doubt others will have different ideas on what these
    differences correspond to, based on totally different analyzed data, but
    maybe we can get at some consensus about what these categories (or a
    smaller number of categories, perhaps) might correspond to
    psychologically. This last word is important, as there seem to exist
    various factors which may make a relatively infrequent word
    psychologically more salient, or vice versa (eg, 'berserk' is actually
    almost never encountered in the earlier, pre-computer word counts
    [corpora of a few hundred Kwords to about 18 Mwords], and nevertheless
    acts phonologically in some ways like a 'medium' frequent word--there is
    something about its phonological shape [apparently] which makes it
    extremely salient for English speakers.
            By the way, there is also some evidence in the article which
    calls into question whether, in at least some cases, nonautomatic
    morphophonemic alternation may produce distinct lexical entries, for at
    least some effects (specifically, the first vowel in the verb 'mistake'
    reduces, but the past tense 'mistook' usually has the first vowel
    unreduced, since the two forms fall on opposite sides of the
    'familiar/unfamiliar' frequency dividing line). It is data like these
    that make me interested in frequency counts of forms rather than
    lexemes.

            The article reference is as follows:
    Fidelholtz, James L. 1975. Word frequency and vowel reduction in
    English. _Chicago linguistic society. Regional meeting. Papers_
    11.200-213.
            At some point in the future, there will be an electronic version
    of this article available on the Web, but I can't promise when. I will
    let you know when it is available.
            Jim

    James L. Fidelholtz e-mail: jfidel@siu.buap.mx
    Maestría en Ciencias del Lenguaje
    Instituto de Ciencias Sociales y Humanidades
    Benemérita Universidad Autónoma de Puebla, MÉXICO

    -----------------------------
    Tony Berber Sardinha:

    Hi Pascual

    Would there be something wrong with simply placing 20% of the tokens in each
    frequency band and then 'adjusting' the individual percent freqs to fit within
    the 20% intervals? I simulated this in the spreadsheet that's attached to this
    message.

    I ask because I have an interest in this issue as well and I'm positive my
    approach is far too naive

    abraço
    tony.

    -----------------------------
    Eric Atwell:

    Pascual,
    one point to remember is Zipf's law of frequency distribution
    of countable things in language. You may need to use a logarithmic scale
    in classifying into low/medium/high frequency. For example, many years ago
    I worked on the wordlist and suffixlist used in the LOB Corpus tagging
    program,
    classifying word-tags with words and suffixes on a logarithmic scale:
    POS-tags were classified common/rare/very-rare, where "rare" meant less
    than 10%, "very rare" meant 1% or less,
    eg water NN VB@ means "water" is usually Noun, about 10% Verb

    You need huge data samples to yield frequencies accurate enough to give
    more fine-grained distinctions - I would advise against as many as 5 levels
    Very Low/Low/Moderate/High/Very High unless you are confident you can get
    enough examples to classify with confidence.

    Eric

    Eric Atwell, Distributed Multimedia Systems MSc Tutor & SOCRATES Coordinator
     Centre for Computer Analysis of Language And Speech (CCALAS)
     School of Computer Studies, Faculty of Engineering,
     University of Leeds, LEEDS LS2 9JT, England
     EMAIL: eric@scs.leeds.ac.uk TEL: (44)113-2335430 FAX: (44)113-2335468
     WWW: http://www.scs.leeds.ac.uk/eric

    ---------------------------

    Ramesh Krishnamurthy:

    Dear Dr Gomez
    Cobuild used corpus lemma frequencies in their Dictionary (2nd edition,
    1995). We devised a 5-band distinction, with 700 lemmas in the most
    frequent band, 1200 in the 2nd band, 1500 in the 3rd band, 3200 in the 4th
    and 8100 in the 5th. I can't remember the exact frequency cut-offs used,
    but I'm confident
    that most users of the dictionary have found it a very useful addition.
    The exact cut-off points might be affected by the size of the corpus, and
    may also be language dependent (in a highly inflected language like
    Spanish, there might be different relationships between some types and
    lemmas when compared
    to a realtively uninflected language like English). Also the purpose of
    your classification may affect your decisions. For a dictionary, lemma is
    presumably more important than type, although type distribution within a
    lemma may influence whether a form is treated under the main lemma form,
    or is given separate headword status (e.g. "situated" in an English dictionary
    may be a separate headword, as well as being an inflected form under the
    headword "situate"; similarly "painting" and "paint"; word-class shifts
    would also
    have to be taken into account.).
    Hope this helps.
    Ramesh

    Ramesh Krishnamurthy
    Honorary Research Fellow
    Corpus Research Group
    University of Birmingham

    ___________________________________________________

    Dr. Pascual Cantos Gomez

    Departamento de Filologia Inglesa
    Universidad de Murcia
    C./ Santo Cristo, 1
    30071 Murcia - SPAIN

    Tel: 968 364365; +34 968 364365
    Fax: 968 363185; +34 968 363185
    E-mail: pcantos@fcu.um.es



    This archive was generated by hypermail 2b29 : Fri Feb 25 2000 - 17:14:44 MET