Corpora: Re: Subsets and "partially-tagged" corpora (some actual statistics)

From: Mark Davies (mdavies@ilstu.edu)
Date: Thu May 11 2000 - 21:24:45 MET DST

  • Next message: International Natural Language Generation Conference-Dr. Elhadad: "Corpora: INLG'2000 Second Call for Participation"

    Thanks for all of the replies, both public and private, re. the use of
    "partially-tagged" corpora.

    In particular, following up on the question:

    > >So my question deals with what percentage of all of the occurrences of a
    > >particular category would be included in this subset of most frequent
    > >forms. For example, if there are 100,000 occurrences of infinitives in a
    > >particular block of text (representing 2000 different forms) and I tag just
    > >the 100 most common forms, what percentage of all of the occurrences will
    > >get marked -- 25%, 50%, etc.?

    I did some tests yesterday that suggest what % of the entire occurrences of
    a particular lexical category can be found be using just the 25 or 50 most
    common forms. For example, in an 800,000 word corpus of Spanish short
    stories there are 19,484 occurrences of infinitives, involving 1739
    different verbs. The 25 most common forms (ser, ver, hacer, decir, etc)
    provide a total of 7192 occurrences, or 37% of the total. The 50 most
    common forms give 49% and the 100 most common forms give 62%.

    Not surprisingly, the more limited the number of unique forms for a
    particular category, the higher the percentage of all occurrences that one
    gets with using the subset of the most common forms. For example, there
    are 459 unique forms for the 3SG -ra imperfect subjunctive, giving a total
    of 2346 occurrences. The 25 most common forms (pudiera, quisiera,
    estuviera, etc) account for 50% of all occurrences, and the 50 most common
    forms give 60%.

    So for me, at least, the question remains whether or not the syntax of the
    subset involving the most common forms (which can be easily identified and
    tagged) will be representative of the entire list of unique forms. In
    concrete example, would the syntax of the 50 most common imperfect
    subjunctives differ markedly from the least common forms (e.g. #200-459 on
    the frequency list)? If not, then there might be some value in usually
    partially tagged corpora, at least as an intermediate tool where a corpus
    has not been completely tagged yet (or where it may never be).

    Mark D.

    P.S. For those who are interested in how the data given above was
    extracted, here is the procedure. First, create a word frequency list with
    a concordance program (I used WordSmith). Save it as a CSV file and then
    import this into a database program (I used Access). Then run a query that
    matches that list against a list of all of the unique forms for a
    particular category (I've created a table with all of the conjugations for
    7000+ verbs in Spanish). Then (for ease in calculations) export the results
    to a spreadsheet program (I used Excel), sort by frequency, and then see
    the totals for the 25/50/100 most common forms, as a percentage of the
    total for all forms. Using these three programs one can calculate the %
    for any given verb form in a moderately-sized corpus (1,000,000-3,000,000
    words) in just a couple of minutes.

    =======================================
    Mark Davies, Associate Professor, Spanish Linguistics
    Dept. of Foreign Languages, Illinois State University
    Normal, IL 61790-4300

    Voice:309/438-7975 email:mdavies@ilstu.edu
    Fax:309/438-8038 http://mdavies.for.ilstu.edu/personal/
    =======================================



    This archive was generated by hypermail 2b29 : Thu May 11 2000 - 21:22:37 MET DST