Re: Corpora: Subsets and "partially-tagged" corpora

From: James L. Fidelholtz (jfidel@siu.buap.mx)
Date: Thu May 11 2000 - 18:01:36 MET DST

  • Next message: Mark Davies: "Corpora: Re: Subsets and "partially-tagged" corpora (some actual statistics)"

    On Wed, 10 May 2000, Mark Davies wrote:
    [snip]
    >I am considering an alternative scheme [to full tagging] in which I
    >tag just the most common
    >words/forms for a given syntactic or verbal category, such as the 100 most
    >common nouns and infinitives, the 25 most common adjectives, the 35 most
    >common preterites, etc. The "tagged" elements would be identified by a
    >prefix, such as:
    >
    > VI-estar (= verb/infinitive-"to be")
    > N-hombre (= noun-"man")
    > VPT-supo (= verb/preterite-"knew")
    [snip]
    >So my question deals with what percentage of all of the occurrences of a
    >particular category would be included in this subset of most frequent
    >forms. For example, if there are 100,000 occurrences of infinitives in a
    >particular block of text (representing 2000 different forms) and I tag just
    >the 100 most common forms, what percentage of all of the occurrences will
    >get marked -- 25%, 50%, etc.? I'm going to be carrying out some test
    >myself, but would like to be able to compare the results to other studies
    >that might have already been done.

    Mark:
            I can't give you studies offhand, although I'm sure they
    exist. You could scrounge out the data for English from Thorndike &
    Lorge's book, just grabbing eg all nouns/verbs/?whatever marked AA or
    A (if you want to limit yourself, more, just those with AA), and then
    adding up their total # of occurrences per million (this figure for
    each common word is available somewhere in the book, or perhaps
    elsewhere, at least for the very most common words). From this, you
    could figure out the percentage of total words, which, if you use the
    100 most common nouns, 100 most common verbs, say 50 most common
    adjectives, 50 most common adverbs, ought to give you, at a guess,
    well over 80%, and probably well over 90%, coverage of each category.
    [snip]
            I hope this helps some. Probably someone else might be able
    to give you better skinny, and maybe more recent stuff than T&L.
                    Jim

    James L. Fidelholtz e-mail: jfidel@siu.buap.mx
    Posgrado en Ciencias del Lenguaje tel.: +(52-2)229-5500 x5705
    Instituto de Ciencias Sociales y Humanidades fax: +(01-2) 229-5681
    Benemérita Universidad Autónoma de Puebla, MÉXICO



    This archive was generated by hypermail 2b29 : Thu May 11 2000 - 18:03:30 MET DST