Re: [Corpora-List] ACL proceedings paper in the American National Corpus

From: Nancy Ide (ide@cs.vassar.edu)
Date: Sat Sep 28 2002 - 17:26:56 MET DST

  • Next message: Scott Sadowsky: "Re: [Corpora-List] ACL proceedings paper in the American National Corpus"

    On Friday, September 27, 2002, at 04:18 PM, Simon G. J. Smith wrote:

    >>> Note that this applies to papers whose authors are native speakers of
    >>> American English only.
    >>
    >> Two questions. What is your definition of native speaker? and how are
    >> you
    >> going to determine who meets your definition? This is not as trivial
    >> as it may sound.
    >
    >
    > No, not trivial at all. I presume, though, that since (surely) no
    > records are kept on researchers' linguistic origins, they will simply
    > have to ask everyone if they think they qualify: just as job
    > applicants and others are asked to supply details of what they
    > consider to be their ethnic origin, for statistical purposes.
    >
    > But I'm still curious as to what happens in the not uncommon case
    > where a paper is jointly authored by native and non-native speakers.
    > It can't depend purely on the linguistic origin of the person doing
    > the presentation, because it's the written paper that's being
    > archived, not the talk. The first-named author, perhaps? Or is it safe
    > to assume that if *any* native speakers contributed, someone will have
    > rendered the text into a style sufficiently native-like to qualify
    > anyway. Tricky.
    >

    Very tricky, and unlikely to be entirely solvable. Perhaps we should
    have asked instead for ACL authors who are native speakers of American
    English to identify themselves ;-)

    We can only do our best to identify papers written by people who have
    spent the greater part of their lives (most notably, their younger
    years) in the US. As for non-native speaker co-authors, this becomes
    trickier, as you point out, but in principle the first author on a
    paper should be the most influential in terms of the language contained
    in it. In principle.

    The goal of the ANC project is to compile a massive corpus that will
    reflect American English usage. It has to be massive precisely so that
    we can get thousands of examples in order to have a statistically
    reliable sense of how the language is being used *for the most part*. I
    think even if we were to be able to verify that every author in the ANC
    is a bonafide native speaker of American English (assuming we could
    define it), we'd get plenty of variation anyway. I assume that the BNC
    did not check the pedigree of every author in the corpus either, yet we
    can get a good sample of British English from that data. We're hoping
    the same is true of the ANC.

    That said, as you point out, this opens a huge can of worms, if only to
    bring up the question of what American English is. Given the diversity
    and mobility in this country, the question is even more difficult to
    address than it might be for other languages/locations. Maybe the ANC
    will provide a source for considering it.

    =======================================================

    Nancy Ide

    Professor and Chair
    Department of Computer Science, Vassar College
    Poughkeepsie, NY 12604-0520 USA
    Tel: +1 845 437-5988 Fax: +1 845 437-7498
    ide@cs.vassar.edu

    Chercheur Associe
    Equipe Langue et Dialogue, LORIA/CNRS
    Campus Scientifique - BP 239
    54506 Vandoeuvre-les-Nancy FRANCE
    Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
    ide@loria.fr

    =======================================================



    This archive was generated by hypermail 2b29 : Sat Sep 28 2002 - 17:39:43 MET DST