RE: [Corpora-List] ACL proceedings paper in the American National Corpus

From: Amsler, Robert (Robert.Amsler@hq.doe.gov)
Date: Mon Sep 30 2002 - 18:47:27 MET DST

Next message: LDC Office: "[Corpora-List] New Corpora from the LDC"

Previous message: Copperman, Max: "RE: [Corpora-List] ACL proceedings paper in the American National Corpus"
Maybe in reply to: Nancy Ide: "[Corpora-List] ACL proceedings paper in the American National Corpus"
Next in thread: Michal Sulc: "RE: [Corpora-List] ACL proceedings paper in the American National Corpus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

There is clearly an issue here regarding what the American National Corpus
is trying to represent. The Brown Corpus tried to be "representative" by
extracting equal-sized samples selected from all the publications of a given
year. As has been found, it failed to adequately determine that all the
texts were created by American authors and alas, 1 million words we now know
to be quite small (adequate only for a Pocket Dictionary worth of entries).
Collegiate dictionaries require at least a 10 million word corpus, and
Unabridged dictionaries at least 100 million words (the target of the ANC).

However, what I detect to this point from ANC literature is that they are
first trying to fill the quota of 100 million words and only secondarily
concerned about "balancing" the corpus for genre and sample sizes.

Also, if I'm not mistaken, the Brown corpus didn't JUST balance for genres,
it tried to balance for timespan. I.e., it tried to form a closed universe
of possible publications and then representatively sample from that
universe.
This involves attempting to determine all the possible publications in that
universe and then selecting a subset which represents them in both quantity
and genre. While it may seem ambitious to first decide what is in the list
of all available publications (especially, if your criterion for inclusion
is merely "published after 1990"), it may be the only way to have a universe
from which a truly random sample can be extracted.

Note: Brown Corpus Manual http://www.hit.uib.no/icame/brown/bcm.html

Robert A. Amsler
robert.amsler@hq.doe.gov
(301) 903-8823

Next message: LDC Office: "[Corpora-List] New Corpora from the LDC"
Previous message: Copperman, Max: "RE: [Corpora-List] ACL proceedings paper in the American National Corpus"
Maybe in reply to: Nancy Ide: "[Corpora-List] ACL proceedings paper in the American National Corpus"
Next in thread: Michal Sulc: "RE: [Corpora-List] ACL proceedings paper in the American National Corpus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Sep 30 2002 - 18:56:32 MET DST