RE: [Corpora-List] ACL proceedings paper in the American National Corpus

From: Adam Kilgarriff (adam.kilgarriff@itri.brighton.ac.uk)
Date: Mon Sep 30 2002 - 14:14:12 MET DST

  • Next message: William Mann: "Re: [Corpora-List] ACL proceedings paper in the American National Corpus"

    All,

    I'll second Martin's point about the hazards of specialised text.
    Once you start getting a largish quantity of specialised material in
    what aspires to be a general-purpose corpus, it rapidly gets
    distorted. My bugbear, in relation to the BNC, is GUT (Journal of
    Gastroenterology and Hepatology) of which there are 600,000 words (eg
    just 0.6%). 0.6% might not sound too large, but it is a very
    specialised text type, and means that words like

    gastric a
    mucosa n
    colitis n

    leap up into the top 8000 frequent words of English ( a list that
    doesn't include

    pad v
    regulator n
    wavelength n
    prejudice v
    iron v
    voting a
    escort n
    dynasty n

    )

    This sort of thing can only be avoided by not having too much
    data of any single specialised data type. I would recommend a limit of
    0.5% for lingusitics papers in general, with no subspecialism (eg
    computational lingusitics, or, worse still, parsing) taking more than
    a quarter of that, and a limit of 10,000 words from any single
    document.

    BNC used a sample size from a single document of 40,000 words as its
    default. However most of these documents weren't too specialised so
    ti didn't cause too many problems. It's the combination of
    substantial samples with narrow text-types that is invidious.

    I've only referred to distorted frequency lists in the above. They
    are the easiest effect of distortion to describe. There will also be
    distortions of all sorts of other language-model components (bigrams,
    trigrams, grammars, induced lexicons etc) - the problem is,
    it's hard to describe what or how and the distortions will usually
    go unnoticed, or even feature as "interesting discoveries about the
    lg". That's why it's important to beware these balance issues when
    building a corpus in the first place.

    (And of course, "distortion" is a problem term here as it implies
    there is the possibility of a non-distorted resource. But I won't get
    into that one here...)

    > One way of avoiding this, and many other potential problems which can be
    > found in specialised language, would be to apply a criterion for inclusion
    > of texts in the corpus that they should not be too technical in nature.
    >

    I'm not sure I agree here. Technical language is an important part of
    language, and we are undermining an open-minded view of language if we
    exclude technical langauge wholesale. Maybe the corpus just needs to
    be much (MUCH) bigger so it can include substantial quantities of lots
    of different specialist text types, with none making up more than 0.1%
    of the whole (hey, I know a corpus like that, it's called the web ;-) )

         adam

    NEW!! MSc and Short Courses in Lexical Computing and Lexicography
    Info at

    http://www.itri.brighton.ac.uk/lexicom

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    Adam Kilgarriff
    Senior Research Fellow tel: (44) 1273 642919
    Information Technology Research Institute (44) 1273 642900
    University of Brighton fax: (44) 1273 642908
    Lewes Road
    Brighton BN2 4GJ email: Adam.Kilgarriff@itri.bton.ac.uk
    UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    Martin Wynne writes:
    > Nancy's posting set off some very different alarm bells for me. I would like
    > to draw attention to what I think would be another problem with the
    > inclusion of texts from ACL proceedings in the American National Corpus.
    >
    > Let me start with an interesting case which I came across some years ago.
    > After hearing someone repeat the well-known fact that people don't say
    > 'powerful tea' in English, I thought it would be worth checking for
    > empirical evidence for this. I searched for the phrase in the BNC, and got 3
    > hits. All are from a text source listed as follows:
    >
    > "Large vocabulary semantic analysis for text recognition.
    > Rose, Tony Gerard, u.p.. Sample containing about 42217 words of unpublished
    > miscellanea (domain: applied science)"
    >
    > and they are discussions of exactly the same point, i.e. the fact that you
    > don't say 'powerful tea'.
    >
    > (Incidentally, I also searched in the whole Bank of English and found no
    > hits for "powerful tea", and 39 hits for "weak tea", so the original point
    > is not disproven.)
    >
    > In ACL articles you will also get citations of made-up examples like this,
    > plus listings of 'ungrammatical' sentences. Basically, this problem seems to
    > boil down to the fact that you get a lot of 'mention' rather than 'use' of
    > words and phrases in academic linguistic literature, and this could have a
    > fairly significant effect on the results of linguistic analysis of the
    > corpus. If one of the main reasons for building the corpus is to enable
    > researchers to analyse naturally occurring American English, in order to see
    > what does occur and what doesn't, then letting in lots of made-up example
    > sentences and phrases would make it less fit for the proposed purpose.
    >
    > One way of avoiding this, and many other potential problems which can be
    > found in specialised language, would be to apply a criterion for inclusion
    > of texts in the corpus that they should not be too technical in nature.
    >
    > __
    > Martin Wynne
    > martin.wynne@ota.ahds.ac.uk
    > Linguistics Officer
    > Oxford Text Archive
    >
    > Oxford University Computing Services
    > 13 Banbury Road
    > Oxford
    > UK - OX2 6NN
    > Tel: +44 1865 283299
    > Fax: +44 1865 273275
    >

    -- 
    NEW!! MSc and Short Courses in Lexical Computing and Lexicography
    Info at
    

    http://www.itri.brighton.ac.uk/lexicom

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Adam Kilgarriff Senior Research Fellow tel: (44) 1273 642919 Information Technology Research Institute (44) 1273 642900 University of Brighton fax: (44) 1273 642908 Lewes Road Brighton BN2 4GJ email: Adam.Kilgarriff@itri.bton.ac.uk UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



    This archive was generated by hypermail 2b29 : Mon Sep 30 2002 - 14:24:12 MET DST