Re: Corpora: Frequency Meaning

From: eric@scs.leeds.ac.uk
Date: Thu Feb 17 2000 - 11:14:45 MET

  • Next message: Gabriel Pereira Lopes: "Re: Corpora: testing association strength between elements of trigrams"

    Pascual,
    one point to remember is Zipf's law of frequency distribution
    of countable things in language. You may need to use a logarithmic scale
    in classifying into low/medium/high frequency. For example, many years ago
    I worked on the wordlist and suffixlist used in the LOB Corpus tagging program,
    classifying word-tags with words and suffixes on a logarithmic scale:
    POS-tags were classified common/rare/very-rare, where "rare" meant less
    than 10%, "very rare" meant 1% or less,
    eg water NN VB@ means "water" is usually Noun, about 10% Verb

    You need huge data samples to yield frequencies accurate enough to give
    more fine-grained distinctions - I would advise against as many as 5 levels
    Very Low/Low/Moderate/High/Very High unless you are confident you can get
    enough examples to classify with confidence.

    Eric

    Eric Atwell, Distributed Multimedia Systems MSc Tutor & SOCRATES Coordinator
     Centre for Computer Analysis of Language And Speech (CCALAS)
     School of Computer Studies, Faculty of Engineering,
     University of Leeds, LEEDS LS2 9JT, England
     EMAIL: eric@scs.leeds.ac.uk TEL: (44)113-2335430 FAX: (44)113-2335468
     WWW: http://www.scs.leeds.ac.uk/eric



    This archive was generated by hypermail 2b29 : Thu Feb 17 2000 - 11:19:24 MET