RE: [Corpora-List] ACL proceedings paper in the American National Corpus

From: Amsler, Robert (Robert.Amsler@hq.doe.gov)
Date: Mon Sep 30 2002 - 18:47:27 MET DST

  • Next message: LDC Office: "[Corpora-List] New Corpora from the LDC"

    There is clearly an issue here regarding what the American National Corpus
    is trying to represent. The Brown Corpus tried to be "representative" by
    extracting equal-sized samples selected from all the publications of a given
    year. As has been found, it failed to adequately determine that all the
    texts were created by American authors and alas, 1 million words we now know
    to be quite small (adequate only for a Pocket Dictionary worth of entries).
    Collegiate dictionaries require at least a 10 million word corpus, and
    Unabridged dictionaries at least 100 million words (the target of the ANC).

    However, what I detect to this point from ANC literature is that they are
    first trying to fill the quota of 100 million words and only secondarily
    concerned about "balancing" the corpus for genre and sample sizes.

    Also, if I'm not mistaken, the Brown corpus didn't JUST balance for genres,
    it tried to balance for timespan. I.e., it tried to form a closed universe
    of possible publications and then representatively sample from that
    universe.
    This involves attempting to determine all the possible publications in that
    universe and then selecting a subset which represents them in both quantity
    and genre. While it may seem ambitious to first decide what is in the list
    of all available publications (especially, if your criterion for inclusion
    is merely "published after 1990"), it may be the only way to have a universe
    from which a truly random sample can be extracted.

    Note: Brown Corpus Manual http://www.hit.uib.no/icame/brown/bcm.html

    Robert A. Amsler
    robert.amsler@hq.doe.gov
    (301) 903-8823
     



    This archive was generated by hypermail 2b29 : Mon Sep 30 2002 - 18:56:32 MET DST