Re: [Corpora-List] ACL proceedings paper in the American National Corpus

From: Nancy Ide (ide@cs.vassar.edu)
Date: Mon Sep 30 2002 - 19:14:51 MET DST

  • Next message: Ken Litkowski: "[Corpora-List] [Fwd: [DSNA] Fwd: lexicography & Dr. Who]"

    On Monday, September 30, 2002, at 08:14 AM, Adam Kilgarriff wrote:
    > This sort of thing can only be avoided by not having too much
    > data of any single specialised data type. I would recommend a limit of
    > 0.5% for lingusitics papers in general, with no subspecialism (eg
    > computational lingusitics, or, worse still, parsing) taking more than
    > a quarter of that, and a limit of 10,000 words from any single
    > document.

    Until we actually extract the papers from the ACL data, I have no idea
    what the size of the portion included in the ANC will be. However, it
    will certainly be a tiny percentage of the core corpus, if not the
    entire ANC.

    Our goal is to eventually produce a core corpus containing 100 million
    words, comparable in representative distribution to the BNC (for
    comparison purposes). However, note that unlike the BNC, the ANC will
    include, beyond the 100 million word core, a "varied" component
    consisting of whatever we can get our hands on. These texts will be
    identified by source/genre, and can be used or discarded as desired by
    the user. If in fact the ACL materials seem to comprise a larger
    percentage of the core corpus than reasonable, the rest will be put
    into the varied component.

    >
    > BNC used a sample size from a single document of 40,000 words as its
    > default. However most of these documents weren't too specialised so
    > ti didn't cause too many problems. It's the combination of
    > substantial samples with narrow text-types that is invidious.

    We are certainly aware of this and working to ensure a broad sample. We
    too are sampling texts, taking only a certain number of words from each.

    >
    > I've only referred to distorted frequency lists in the above. They
    > are the easiest effect of distortion to describe. There will also be
    > distortions of all sorts of other language-model components (bigrams,
    > trigrams, grammars, induced lexicons etc) - the problem is,
    > it's hard to describe what or how and the distortions will usually
    > go unnoticed, or even feature as "interesting discoveries about the
    > lg". That's why it's important to beware these balance issues when
    > building a corpus in the first place.

    We are. And we appreciate input such as yours, above, on how to best
    achieve something reasonable.

    > Technical language is an important part of
    > language, and we are undermining an open-minded view of language if we
    > exclude technical langauge wholesale.

    Agreed!

    > Maybe the corpus just needs to
    > be much (MUCH) bigger so it can include substantial quantities of lots
    > of different specialist text types, with none making up more than 0.1%
    > of the whole (hey, I know a corpus like that, it's called the web ;-) )

    This is really what the ANC hopes to be in the end. The rationale
    behind the varied component is just that: put in what you can get, and
    it should be possible to construct a sub-corpus from that data on the
    basis of your own criteria, given that one can make a selection based
    on text type/source.

    As for the web, yes you have lots of specialized text types there--and
    that is just the problem if one wants data that covers generalized
    language usage. A small experiment we did and reported at LREC last May
    suggested that web language on the whole is dramatically skewed toward
    dense, academic-like prose (see Ide, N., Reppen, R., Suderman, K.
    (2002). The American National Corpus: More Than the Web Can Provide.
    Proceedings of the Third Language Resources and Evaluation Conference
    (LREC), Las Palmas, Canary Islands, Spain, 839-44. Available at
    http://www.cs.vassar.edu/~ide/papers/anc-lrec02.ps). We argue,
    therefore, that no matter how much data you cull from the web, it will
    be significantly skewed toward one end of the spectrum of "style" or
    type.

    A final point: The first release of 10 million words of the ANC, due
    out in a month or so, will not be at all balanced--it will consist of
    whatever data we have so far, as we are constrained by which texts have
    been provided at what point and how much processing is required to put
    them in a usable format. The intent of the first release is to provide
    something quickly for our consortium members, and to enable them to
    test the (very minimal) search and access interface and provide input
    for the design of the final one, but we assume that many researchers
    will also use it, whatever the content.

    Nancy
    ======================================================

    Nancy Ide

    Professor and Chair
    Department of Computer Science, Vassar College
    Poughkeepsie, NY 12604-0520 USA
    Tel: +1 845 437-5988 Fax: +1 845 437-7498
    ide@cs.vassar.edu

    Chercheur Associe
    Equipe Langue et Dialogue, LORIA/CNRS
    Campus Scientifique - BP 239
    54506 Vandoeuvre-les-Nancy FRANCE
    Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
    ide@loria.fr

    =======================================================



    This archive was generated by hypermail 2b29 : Mon Sep 30 2002 - 19:19:48 MET DST