Corpora: Corpus representativeness: A "summary" of the query

From: Sampo Nevalainen (samponev@cc.joensuu.fi)
Date: Thu Aug 30 2001 - 15:07:03 MET DST

  • Next message: Arno Scharl: "Corpora: Automatic Language Identification [Summary of Responses]"

    Corpus representativeness: A "summary" of the query

    First, I want to apologize for not writing this summary before. I sent my
    query to the corpus list in November 2000, that is, almost a year ago! I
    have been busy with other things, but I admit that the longish delay is
    partly due to my sloppiness. I would like to thank all those nice people
    who used (wasted?) their valuable time to answer my questions. Although I
    did not get many answers, they were all very interesting and important for
    me. I am grateful to the following people who kindly assisted me (in
    alphabetical order; no preference ;-)):

    Eric Atwell
    Eleanor Batchelder
    Pascual Cantos
    Florence Duclaye
    Bill Fisher
    Shlomo Izre'el
    Ramesh Krishnamurthy
    Petek Kurtboke
    Uta Lausberg de Morales
    Geoffrey Williams

    I apologize my unintentional negligence, if I did not mention someone who I
    should have. Since some of the respondents wished to remain anonymous, I
    shall generally not refer to the author in the following compilation of
    e-mails, even though I take advantage of straight citations. (Consequently,
    it is (un)fortunately pretty easy for people involved to deduce "who said
    what"…) However, if you are interested in particular citations, you may ask
    me for the author to be contacted for further information, but only if
    (s)he did not wish to remain anonymous.

    I underline that the ideas presented below are not my personal thoughts
    (although I can mostly agree with them.) In general, the respondents seem
    to have a pretty fine consensus on what representativeness is or SHOULD be
    in corpus linguistics, but as we all know, practise is often different to
    theory. Unsurprisingly, we'll see that there are several approaches to this
    issue, depending on the field of interest. While summing up the answers I
    have got so far, I am still willing to hear about people who have (any kind
    of) ideas about representativeness in corpus linguistics. (hint hint ;-))
    Feel free to contact me.

    The "summary" (read: a messy compilation of citations) is divided into
    three parts:
    1) Towards the concept of representativeness
            - short citations about representativeness as a concept
    2) Considerations and methods in the pursuit for representativeness
            - some general questions arising from the material
            - longer citations, for those who want more context :-)
    3) References and links

    Clarifying additions are presented in [angle brackets], while (…) indicates
    that some text fragments were left out. Note that some of the citations in
    the first part are presented also in the more extensive citations of the
    second part to ensure readability.

    ----------------------------------------------------------------------------------------------------------------

    1. TOWARDS THE CONCEPT OF REPRESENTATIVENESS

    " (…) "representativeness" depends on the application, there can be no such
    thing as a generically representative corpus."

    "We don't tackle the issue of representativeness directly but via
    predictability."

    "What is the corpus to be "representative" of?"

    "Representativeness depends on the purpose of the corpus."

    "For me, representativeness is without compromise: it includes sampling of
    both demographic varieties and contextual varieties."

    " "Representativeness" to me in that arena [speech recognition evaluation]
    means "How well is the test set represented by the training set?" "

    "The Brown corpus (1960s, Kucera and Francis) seems to be generally
    considered to be a "representative" corpus (…)"

    "A lot depends on your corpus, if you are building a reference corpus then
    you have to follow Atkins & Clear, Biber etc to have 'balanced' samples of
    different genre. If (…) you are concerned with special languages then you
    must change your criteria. (…) --- This is still not really representative,
    personally I don't believe that really exists. We replace this by
    justification."

    "Representativeness of a corpus implies that you are working on a
    particular theme, and you are trying to give people a general overview of
    it. (…) the keywords behind representativeness are : main subjects of a
    theme, brief information on these subjects, and links to know more if
    desired. (…) a representative corpus must remain as neutral as possible, so
    that the readers get an objective point of view of the subject. Or, if the
    theme requires to give an opinion, then it should give all the opinions
    existing on the same subject."

    2. CONSIDERATIONS AND METHODS IN THE PURSUIT FOR REPRESENTATIVENESS

    general questions:
    - what is the corpus to be "representative" of?
    - how to measure representativeness?
    - how to define the structure of the corpus (categories of texts)?
    - what about variety? should we use language "production" or "consumption"
    as a criterion? how to judge "correctness" and "incorrectness"? is
    "vintage" a matter of date of production or date of consumption? what is
    the relationship between "ideal" and "actual"?
    - how to ensure comparability?

    " (…) "representativeness" depends on the application, there can be no such
    thing as a generically representative corpus. (…) for this [grammatical
    analysis and part-of-speech tagging], the genre of the text is less
    important than for, say, dialog-act modelling, since grammar varies less
    between genres (…). On the other hand, if every researcher is free to
    select their own "representative" text-set for their own application, how
    can we comparatively evaluate across research grounded on different
    corpora? --- (…) The original taggers for LOB, UPenn, ICE etc
    corpus-annotation schemes started from different "representative" corpora,
    so accuracy rates reported by these projects, in terms of their own
    "representative" corpora, may not be directly comparable."

    " "Representativeness" to me in that arena [speech recognition evaluation]
    means "How well is the test set represented by the training set?". (The
    usual paradigm is for a large sample of transcribed speech to be made
    available to sites being evaluated, for their use in automatically training
    their recognizers; then a smaller sample of similar material is presented
    to their recognizers for a test and the output hypothesized by the
    recognizers is scored against human-derived reference
    transcriptions.) It's widely regarded as an unfair test if the test data
    is not represented well by the training data. --- When the training set is
    explicitly defined, the representativeness of the test set can be estimated
    pretty well by the test set perplexity of the test set relative to a
    statistical language model derived solely from the training set. (…)"

    "Last year I worked on the question of whether two test sets drawn from
    telephone speech recorded at different times were equally difficult for
    recognizers to recognize. Since the training data was not a specific set,
    I tried to get at it by assuming that one factor of difficulty was the
    homogeneity of the test set; that is, a set of utterances that are more
    alike is inherently easier to recognize. This follows, I think, if you
    assume that the training data is drawn from a sample space typified by the
    test set. I then estimated the homogeneity of each test set by averaging
    the results of a number of randomized experiments, in each of which I
    measured the representativeness of a randomly-chosen tenth of the utterance
    relative to the rest, computing representativeness as the perplexity of the
    chosen utterances using an ngram language model trained up solely on the
    other nine-tenths of the utterances. In other words, homogeneity = average
    representativeness of one fraction of the set relative to the other. I
    made scripts and programs to do these calculations, but the project kind of
    bogged down at that point because the actual test results, which I would
    have used to validate my method, were in fact produced by sites all using
    the same arbitrary language model rather than ones trained up on different
    training sets. Also, I discovered that my work had been foreshadowed by
    Adam Kilgarriff and Tony Rose: check out their paper "Measures for Corpus
    Similarity and Homogeneity". "

    "The Brown corpus (1960s, Kucera and Francis) seems to be generally
    considered to be a "representative" corpus, and LOB, SEU and ICE corpora
    are designed in a very similar way: the corpus consists of 500 texts of
    2000 words each (to make a 1 million word corpus). 300 spoken and 200
    written texts. Spoken consists of 180 Dialogue texts and 120 Monologues.
    Written consist of 150 Printed and 50 Non-printed texts. Each of these
    categories are then subdivided, and so on. My objections to this "a priori"
    design are: a) some categories of texts are very difficult to obtain (e.g.
    business transactions, because of commercial confidentiality) b) many
    categories of texts are omitted (e.g. email) c) there is no justification
    for the proportions: I do not know of any sociolinguistic research which
    says that the average person consumes/produces 3/5 spoken language and 2/5
    written language (just to take the first main categorial division). The
    proportions for sub-categories are even more questionable."

    - "What is the corpus to be "representative" of? Current estimates
    (Crystal, British Council, etc) suggest there are 1500 million speakers of
    English, 750m EFL speakers/users, 350m ESL, and 350m "native-speakers".
    Should a corpus of "contemporary English" include all of these?
    Representativeness depends on the purpose of the corpus. If we want to know
    what "modern English" is like, we should certainly include all types of
    speakers/users."
    - "What about "variety"? Some Thai users of English may favour American
    English, others British English, others Australian English. Most probably
    use a mixture."
    - "Should language "production" or "consumption" be the criterion? Most of
    us consume more than we produce in an average day, I suspect."
    - " "Correct" and "Incorrect": how are we to judge? Should this be a
    criterion? (Certainly it is for EFL dictionary compilers: what models of
    English should we be a) describing and b) recommending?"
    - " "Vintage": if we are collecting a corpus of "modern English", when does
    "modern" begin? Some texts written a long time ago are still popular (on
    reading lists, or e.g. Agatha Christie crime thrillers, P.G. Wodehouse,
    etc) - again, is it a matter of date of production or date of consumption?"
    - " "Ideal" vs "Actual": 50% of humans are men, 50% women. But what is the
    ratio of published books, newspaper articles, broadcast items, etc? Are men
    and women equally disseminated? I suspect not. So should the corpus reflect
    the actual reality/inequality, or the ideal? The former may reinforce
    stereotypes, the latter may just create new ones."

    "(…) If 1500 million people are using English every day, how can we ever
    capture more than an infinitesimal sample? Cobuild's Bank of English corpus
    is now 418 million words, and various people (Stubbs, Church and Lieberman,
    Gottlieb) have tried to estimate the amount of language an average human
    experiences in a lifetime, and end up with figures around the 500 million
    word mark. --- These are just a few of the problems relating to
    "representativeness" (…). But I have only been thinking of "modern
    English", not diachronic, not other languages, and the corpus only as
    written (…), not as audio or even video data (…) - because as linguists we
    ought to deal with pronunciation, intonation, etc and also with
    extra-linguistic aspects such as gesture (…) and who or what we are looking
    at when we're speaking, etc."

    "A lot depends on your corpus, if you are building a reference corpus then
    you have to follow Atkins & Clear, Biber etc to have 'balanced' samples of
    different genre. If, like me, you are concerned with special languages then
    you must change your criteria. I have always thrown out the idea of
    sublanguages as defined by Harris, and used in much NLP and IA research.
    This is a generative approach, and like all generative approaches tends to
    ignore reality. The classical sublanguage approach views science languages
    as realisations of bibliographical systems, such as Dewey. They go deeper
    into the Dewey system and then try to justify boundaries that delimits one
    group from another. This is not very useful (…) in that they ignore
    multidisciplinarity which is the basis of all research, for instance in
    medicine you call upon biology, chemistry, statistics, if you remove all of
    these you have nothing less. (…) Outside of humans, language does not
    exist, there is no Platonic cave of reality out there. --- If language is
    essentially human, it would seem more intelligent to approach
    representativeness from the point of view of the language users, anathema
    to a generative linguist. To do this rather than think in terms of
    disciplines we think in terms of discourse communities and define
    representative in terms of the language they produce. This is still not
    really representative, personally I don't believe that really exists. We
    replace this by justification."

    3. REFERENCES AND LINKS

    "Check the archive of corpora-list, as I'm sure, as you yourself state,
    that this topic has been discussed before. Biber, Biber and Finegan, Leech,
    Sinclair, Stubbs, Atkins and Clear and Ostler, and many others have
    certainly written about this topic."

    "for representativeness of oral corpora you can read introduction books to
    quantitative sociology, as well as literature about Latin American language
    atlases. Next year I [Dr. Uta Lausberg de Morales] will publish an article
    in the journal "neue romania" (Berlin) about an oral corpus of Guatemalan
    Spanish, and there I will allude to the problem of representativeness as
    well (the article will be in German)."

    Atwell E, Demetriou G, Hughes J, Schiffrin A, Souter C, and Wilcock S.
    2000. A comparative evaluation of modern English corpus grammatical
    annotation schemes. ICAME Journal, volume 24, pages 7-23, International
    Computer Archive of Modern and medieval English, HIT Centre, Bergen
    University. ISSN: 0801-5775

    Bowker, L. Towards a methodology foe exploiting specialised target language
    corpora as translation resources. International Journal of Corpus
    Linguistics. Vol.5/1: 17-52.

    Aquilino Sánchez and Pascual Cantos (1997) "Predictability of Word Forms
    (Types) and Lemmas in Linguistic Corpora. A Case Study Based on the
    Analysis of the CUMBRE Corpus: An 8-Million-Word Corpus of Contemporary
    Spanish". International Journal of Corpus Linguistics 2/2: 259-280. (See
    abstract http://solaris3.ids-mannheim.de/~ijcl/ijcl-2-2.html).

    Sánchez, A. and P. Cantos (1998) "El ritmo incremental de palabras nuevas
    en los repertorios de textos. Estudio experimental y comparativo basado en
    dos corpus lingüísticos equivalentes de cuatro millones de palabras, de las
    lenguas inglesa y española y en cinco autores de ambas lenguas". ATLANTIS,
    19/2: 205-223.

    Meyer, I., Mackintosh, K., The Corpus from a Terminographer's viewpoint.
    International Journal of Corpus Linguistics. Vol.1/2: 257-285.

    Williams, G. 1998. Collocational Networks: Interlocking Patterns of Lexis
    in a Corpus of Plant Biology Research Articles. International Journal of
    Corpus Linguistics. Vol.3/1: 151-171.

    Williams, G. 1999. Looking in before looking out: Internal selection
    criteria in a corpus of plant biology. Papers in Computational
    Lexicography. Complex '99. Hungary: Budapest.: 195-204.

    S Yang, Dan-Hee, Cantos, P. and Song, Mansuk (2000) "An Algorithm for
    Predicting the Relationship between Lemmas and Corpus Size", ETRI Journal,
    22/2: 20-31 (http://etlars.etri.re.kr/etrij/index.html)

    The Corpus of Spoken Israeli Hebrew:
    http://spinoza.tau.ac.il/hci/dep/semitic/maamad.html (Hebrew text)
    http://spinoza.tau.ac.il/hci/dep/semitic/cosih.html (English text)

    Have a look at
    http://www.vicnet.net.au/~petek/thesis/

    Try the archives at http://www.hit.uib.no/corpora/

    ( : ============================================= : )

    Sampo Nevalainen, M.A.
    Researcher
    University of Joensuu
    Savonlinna School of Translation Studies
    P.O.Box 48
    FIN-57101 Savonlinna
    FINLAND

    tel +358-15-511 70 (operator)
            +358-15-511 7704
    fax +358-15-515 096
    email samponev@cc.joensuu.fi



    This archive was generated by hypermail 2b29 : Thu Aug 30 2001 - 15:03:47 MET DST