Re: [Corpora-List] ACL proceedings paper in the American National Corpus

From: William Mann (bill_mann@sil.org)
Date: Mon Sep 30 2002 - 15:09:31 MET DST

  • Next message: Simon G. J. Smith: "Re: [Corpora-List] ACL proceedings paper in the American National Corpus"

    The sort of distortion that Adam Kilgarriff cites has been with us from the
    beginning. Look at the Brown Corpus, 1 million words (so large in its day)
    and look at the high frequency of the English word 'jabberwocky.'

    This is really raising questions about the conceptual foundations of the
    whole enterprise. Have we assumed that 'English' is not simply a collective
    term, representing a range of specializations and dialects than no one could
    possible learn entirely? Have we assumed that "I speak English" has some
    denotational sense? If so, what?

    Have we assumed that 'English' has a boundary, and it is our job to find it?
    Probably not, but then we should avoid boundary-finding activities.

    The part of our conceptual foundations that might be the most troublesome is
    the latter one. We tend to try to define 'English' as a category. That
    leads to set theory, and to seeking the boundaries of 'English' and
    'speakers of English.'

    It does not need to be so. Perhaps we can see these categories in terms of
    prototypes, and seek the central most representative cases rather than the
    boundaries. It is an alternative.

    I have no idea whether such an orientation would conflict with basic
    assumptions that are current in Corpus Based Linguistics. It seems
    worthwhile to ask.

    Bill Mann

    ----- Original Message -----
    From: "Adam Kilgarriff" <adam.kilgarriff@itri.brighton.ac.uk>
    To: "Martin Wynne" <martin.wynne@ota.ahds.ac.uk>
    Cc: <corpora@hd.uib.no>
    Sent: Monday, September 30, 2002 8:14 AM
    Subject: RE: [Corpora-List] ACL proceedings paper in the American National
    Corpus

    >
    > All,
    >
    > I'll second Martin's point about the hazards of specialised text.
    > Once you start getting a largish quantity of specialised material in
    > what aspires to be a general-purpose corpus, it rapidly gets
    > distorted. My bugbear, in relation to the BNC, is GUT (Journal of
    > Gastroenterology and Hepatology) of which there are 600,000 words (eg
    > just 0.6%). 0.6% might not sound too large, but it is a very
    > specialised text type, and means that words like
    >
    > gastric a
    > mucosa n
    > colitis n
    >
    > leap up into the top 8000 frequent words of English ( a list that
    > doesn't include
    >
    > pad v
    > regulator n
    > wavelength n
    > prejudice v
    > iron v
    > voting a
    > escort n
    > dynasty n
    >
    > )
    >
    > This sort of thing can only be avoided by not having too much
    > data of any single specialised data type. I would recommend a limit of
    > 0.5% for lingusitics papers in general, with no subspecialism (eg
    > computational lingusitics, or, worse still, parsing) taking more than
    > a quarter of that, and a limit of 10,000 words from any single
    > document.
    >
    > BNC used a sample size from a single document of 40,000 words as its
    > default. However most of these documents weren't too specialised so
    > ti didn't cause too many problems. It's the combination of
    > substantial samples with narrow text-types that is invidious.
    >
    > I've only referred to distorted frequency lists in the above. They
    > are the easiest effect of distortion to describe. There will also be
    > distortions of all sorts of other language-model components (bigrams,
    > trigrams, grammars, induced lexicons etc) - the problem is,
    > it's hard to describe what or how and the distortions will usually
    > go unnoticed, or even feature as "interesting discoveries about the
    > lg". That's why it's important to beware these balance issues when
    > building a corpus in the first place.
    >
    > (And of course, "distortion" is a problem term here as it implies
    > there is the possibility of a non-distorted resource. But I won't get
    > into that one here...)
    >
    > > One way of avoiding this, and many other potential problems which can
    be
    > > found in specialised language, would be to apply a criterion for
    inclusion
    > > of texts in the corpus that they should not be too technical in nature.
    > >
    >
    > I'm not sure I agree here. Technical language is an important part of
    > language, and we are undermining an open-minded view of language if we
    > exclude technical langauge wholesale. Maybe the corpus just needs to
    > be much (MUCH) bigger so it can include substantial quantities of lots
    > of different specialist text types, with none making up more than 0.1%
    > of the whole (hey, I know a corpus like that, it's called the web ;-) )
    >
    > adam
    >
    >
    > NEW!! MSc and Short Courses in Lexical Computing and Lexicography
    > Info at
    >
    > http://www.itri.brighton.ac.uk/lexicom
    >
    > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    > Adam Kilgarriff
    > Senior Research Fellow tel: (44) 1273 642919
    > Information Technology Research Institute (44) 1273 642900
    > University of Brighton fax: (44) 1273 642908
    > Lewes Road
    > Brighton BN2 4GJ email: Adam.Kilgarriff@itri.bton.ac.uk
    > UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
    > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    >
    >
    >
    >
    > Martin Wynne writes:
    > > Nancy's posting set off some very different alarm bells for me. I would
    like
    > > to draw attention to what I think would be another problem with the
    > > inclusion of texts from ACL proceedings in the American National
    Corpus.
    > >
    > > Let me start with an interesting case which I came across some years
    ago.
    > > After hearing someone repeat the well-known fact that people don't say
    > > 'powerful tea' in English, I thought it would be worth checking for
    > > empirical evidence for this. I searched for the phrase in the BNC, and
    got 3
    > > hits. All are from a text source listed as follows:
    > >
    > > "Large vocabulary semantic analysis for text recognition.
    > > Rose, Tony Gerard, u.p.. Sample containing about 42217 words of
    unpublished
    > > miscellanea (domain: applied science)"
    > >
    > > and they are discussions of exactly the same point, i.e. the fact that
    you
    > > don't say 'powerful tea'.
    > >
    > > (Incidentally, I also searched in the whole Bank of English and found
    no
    > > hits for "powerful tea", and 39 hits for "weak tea", so the original
    point
    > > is not disproven.)
    > >
    > > In ACL articles you will also get citations of made-up examples like
    this,
    > > plus listings of 'ungrammatical' sentences. Basically, this problem
    seems to
    > > boil down to the fact that you get a lot of 'mention' rather than 'use'
    of
    > > words and phrases in academic linguistic literature, and this could
    have a
    > > fairly significant effect on the results of linguistic analysis of the
    > > corpus. If one of the main reasons for building the corpus is to enable
    > > researchers to analyse naturally occurring American English, in order
    to see
    > > what does occur and what doesn't, then letting in lots of made-up
    example
    > > sentences and phrases would make it less fit for the proposed purpose.
    > >
    > > One way of avoiding this, and many other potential problems which can
    be
    > > found in specialised language, would be to apply a criterion for
    inclusion
    > > of texts in the corpus that they should not be too technical in nature.
    > >
    > > __
    > > Martin Wynne
    > > martin.wynne@ota.ahds.ac.uk
    > > Linguistics Officer
    > > Oxford Text Archive
    > >
    > > Oxford University Computing Services
    > > 13 Banbury Road
    > > Oxford
    > > UK - OX2 6NN
    > > Tel: +44 1865 283299
    > > Fax: +44 1865 273275
    > >
    >
    > --
    > NEW!! MSc and Short Courses in Lexical Computing and Lexicography
    > Info at
    >
    > http://www.itri.brighton.ac.uk/lexicom
    >
    > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    > Adam Kilgarriff
    > Senior Research Fellow tel: (44) 1273 642919
    > Information Technology Research Institute (44) 1273 642900
    > University of Brighton fax: (44) 1273 642908
    > Lewes Road
    > Brighton BN2 4GJ email: Adam.Kilgarriff@itri.bton.ac.uk
    > UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
    > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    >



    This archive was generated by hypermail 2b29 : Mon Sep 30 2002 - 15:17:22 MET DST