Corpora: Corpus representativeness: A "summary" of the query

From: Sampo Nevalainen (samponev@cc.joensuu.fi)
Date: Thu Aug 30 2001 - 15:07:03 MET DST

Next message: Arno Scharl: "Corpora: Automatic Language Identification [Summary of Responses]"

Previous message: Andrew Kehoe: "Corpora: Job Vacancy - Research Associate/Fellow"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Corpus representativeness: A "summary" of the query

First, I want to apologize for not writing this summary before. I sent my
query to the corpus list in November 2000, that is, almost a year ago! I
have been busy with other things, but I admit that the longish delay is
partly due to my sloppiness. I would like to thank all those nice people
who used (wasted?) their valuable time to answer my questions. Although I
did not get many answers, they were all very interesting and important for
me. I am grateful to the following people who kindly assisted me (in
alphabetical order; no preference ;-)):

Eric Atwell
Eleanor Batchelder
Pascual Cantos
Florence Duclaye
Bill Fisher
Shlomo Izre'el
Ramesh Krishnamurthy
Petek Kurtboke
Uta Lausberg de Morales
Geoffrey Williams

I apologize my unintentional negligence, if I did not mention someone who I
should have. Since some of the respondents wished to remain anonymous, I
shall generally not refer to the author in the following compilation of
e-mails, even though I take advantage of straight citations. (Consequently,
it is (un)fortunately pretty easy for people involved to deduce "who said
what"…) However, if you are interested in particular citations, you may ask
me for the author to be contacted for further information, but only if
(s)he did not wish to remain anonymous.

I underline that the ideas presented below are not my personal thoughts
(although I can mostly agree with them.) In general, the respondents seem
to have a pretty fine consensus on what representativeness is or SHOULD be
in corpus linguistics, but as we all know, practise is often different to
theory. Unsurprisingly, we'll see that there are several approaches to this
issue, depending on the field of interest. While summing up the answers I
have got so far, I am still willing to hear about people who have (any kind
of) ideas about representativeness in corpus linguistics. (hint hint ;-))
Feel free to contact me.

The "summary" (read: a messy compilation of citations) is divided into
three parts:
1) Towards the concept of representativeness
        - short citations about representativeness as a concept
2) Considerations and methods in the pursuit for representativeness
        - some general questions arising from the material
        - longer citations, for those who want more context :-)
3) References and links

Clarifying additions are presented in [angle brackets], while (…) indicates
that some text fragments were left out. Note that some of the citations in
the first part are presented also in the more extensive citations of the
second part to ensure readability.

----------------------------------------------------------------------------------------------------------------

1. TOWARDS THE CONCEPT OF REPRESENTATIVENESS

" (…) "representativeness" depends on the application, there can be no such
thing as a generically representative corpus."

"We don't tackle the issue of representativeness directly but via
predictability."

"What is the corpus to be "representative" of?"

"Representativeness depends on the purpose of the corpus."

"For me, representativeness is without compromise: it includes sampling of
both demographic varieties and contextual varieties."

" "Representativeness" to me in that arena [speech recognition evaluation]
means "How well is the test set represented by the training set?" "

"The Brown corpus (1960s, Kucera and Francis) seems to be generally
considered to be a "representative" corpus (…)"

"A lot depends on your corpus, if you are building a reference corpus then
you have to follow Atkins & Clear, Biber etc to have 'balanced' samples of
different genre. If (…) you are concerned with special languages then you
must change your criteria. (…) --- This is still not really representative,
personally I don't believe that really exists. We replace this by
justification."

"Representativeness of a corpus implies that you are working on a
particular theme, and you are trying to give people a general overview of
it. (…) the keywords behind representativeness are : main subjects of a
theme, brief information on these subjects, and links to know more if
desired. (…) a representative corpus must remain as neutral as possible, so
that the readers get an objective point of view of the subject. Or, if the
theme requires to give an opinion, then it should give all the opinions
existing on the same subject."

2. CONSIDERATIONS AND METHODS IN THE PURSUIT FOR REPRESENTATIVENESS

general questions:
- what is the corpus to be "representative" of?
- how to measure representativeness?
- how to define the structure of the corpus (categories of texts)?
- what about variety? should we use language "production" or "consumption"
as a criterion? how to judge "correctness" and "incorrectness"? is
"vintage" a matter of date of production or date of consumption? what is
the relationship between "ideal" and "actual"?
- how to ensure comparability?

" (…) "representativeness" depends on the application, there can be no such
thing as a generically representative corpus. (…) for this [grammatical
analysis and part-of-speech tagging], the genre of the text is less
important than for, say, dialog-act modelling, since grammar varies less
between genres (…). On the other hand, if every researcher is free to
select their own "representative" text-set for their own application, how
can we comparatively evaluate across research grounded on different
corpora? --- (…) The original taggers for LOB, UPenn, ICE etc
corpus-annotation schemes started from different "representative" corpora,
so accuracy rates reported by these projects, in terms of their own
"representative" corpora, may not be directly comparable."

" "Representativeness" to me in that arena [speech recognition evaluation]
means "How well is the test set represented by the training set?". (The
usual paradigm is for a large sample of transcribed speech to be made
available to sites being evaluated, for their use in automatically training
their recognizers; then a smaller sample of similar material is presented
to their recognizers for a test and the output hypothesized by the
recognizers is scored against human-derived reference
transcriptions.) It's widely regarded as an unfair test if the test data
is not represented well by the training data. --- When the training set is
explicitly defined, the representativeness of the test set can be estimated
pretty well by the test set perplexity of the test set relative to a
statistical language model derived solely from the training set. (…)"

"Last year I worked on the question of whether two test sets drawn from
telephone speech recorded at different times were equally difficult for
recognizers to recognize. Since the training data was not a specific set,
I tried to get at it by assuming that one factor of difficulty was the
homogeneity of the test set; that is, a set of utterances that are more
alike is inherently easier to recognize. This follows, I think, if you
assume that the training data is drawn from a sample space typified by the
test set. I then estimated the homogeneity of each test set by averaging
the results of a number of randomized experiments, in each of which I
measured the representativeness of a randomly-chosen tenth of the utterance
relative to the rest, computing representativeness as the perplexity of the
chosen utterances using an ngram language model trained up solely on the
other nine-tenths of the utterances. In other words, homogeneity = average
representativeness of one fraction of the set relative to the other. I
made scripts and programs to do these calculations, but the project kind of
bogged down at that point because the actual test results, which I would
have used to validate my method, were in fact produced by sites all using
the same arbitrary language model rather than ones trained up on different
training sets. Also, I discovered that my work had been foreshadowed by
Adam Kilgarriff and Tony Rose: check out their paper "Measures for Corpus
Similarity and Homogeneity". "

"The Brown corpus (1960s, Kucera and Francis) seems to be generally
considered to be a "representative" corpus, and LOB, SEU and ICE corpora
are designed in a very similar way: the corpus consists of 500 texts of
2000 words each (to make a 1 million word corpus). 300 spoken and 200
written texts. Spoken consists of 180 Dialogue texts and 120 Monologues.
Written consist of 150 Printed and 50 Non-printed texts. Each of these
categories are then subdivided, and so on. My objections to this "a priori"
design are: a) some categories of texts are very difficult to obtain (e.g.
business transactions, because of commercial confidentiality) b) many
categories of texts are omitted (e.g. email) c) there is no justification
for the proportions: I do not know of any sociolinguistic research which
says that the average person consumes/produces 3/5 spoken language and 2/5
written language (just to take the first main categorial division). The
proportions for sub-categories are even more questionable."

- "What is the corpus to be "representative" of? Current estimates
(Crystal, British Council, etc) suggest there are 1500 million speakers of
English, 750m EFL speakers/users, 350m ESL, and 350m "native-speakers".
Should a corpus of "contemporary English" include all of these?
Representativeness depends on the purpose of the corpus. If we want to know
what "modern English" is like, we should certainly include all types of
speakers/users."
- "What about "variety"? Some Thai users of English may favour American
English, others British English, others Australian English. Most probably
use a mixture."
- "Should language "production" or "consumption" be the criterion? Most of
us consume more than we produce in an average day, I suspect."
- " "Correct" and "Incorrect": how are we to judge? Should this be a
criterion? (Certainly it is for EFL dictionary compilers: what models of
English should we be a) describing and b) recommending?"
- " "Vintage": if we are collecting a corpus of "modern English", when does
"modern" begin? Some texts written a long time ago are still popular (on
reading lists, or e.g. Agatha Christie crime thrillers, P.G. Wodehouse,
etc) - again, is it a matter of date of production or date of consumption?"
- " "Ideal" vs "Actual": 50% of humans are men, 50% women. But what is the
ratio of published books, newspaper articles, broadcast items, etc? Are men
and women equally disseminated? I suspect not. So should the corpus reflect
the actual reality/inequality, or the ideal? The former may reinforce
stereotypes, the latter may just create new ones."

"(…) If 1500 million people are using English every day, how can we ever
capture more than an infinitesimal sample? Cobuild's Bank of English corpus
is now 418 million words, and various people (Stubbs, Church and Lieberman,
Gottlieb) have tried to estimate the amount of language an average human
experiences in a lifetime, and end up with figures around the 500 million
word mark. --- These are just a few of the problems relating to
"representativeness" (…). But I have only been thinking of "modern
English", not diachronic, not other languages, and the corpus only as
written (…), not as audio or even video data (…) - because as linguists we
ought to deal with pronunciation, intonation, etc and also with
extra-linguistic aspects such as gesture (…) and who or what we are looking
at when we're speaking, etc."

"A lot depends on your corpus, if you are building a reference corpus then
you have to follow Atkins & Clear, Biber etc to have 'balanced' samples of
different genre. If, like me, you are concerned with special languages then
you must change your criteria. I have always thrown out the idea of
sublanguages as defined by Harris, and used in much NLP and IA research.
This is a generative approach, and like all generative approaches tends to
ignore reality. The classical sublanguage approach views science languages
as realisations of bibliographical systems, such as Dewey. They go deeper
into the Dewey system and then try to justify boundaries that delimits one
group from another. This is not very useful (…) in that they ignore
multidisciplinarity which is the basis of all research, for instance in
medicine you call upon biology, chemistry, statistics, if you remove all of
these you have nothing less. (…) Outside of humans, language does not
exist, there is no Platonic cave of reality out there. --- If language is
essentially human, it would seem more intelligent to approach
representativeness from the point of view of the language users, anathema
to a generative linguist. To do this rather than think in terms of
disciplines we think in terms of discourse communities and define
representative in terms of the language they produce. This is still not
really representative, personally I don't believe that really exists. We
replace this by justification."

3. REFERENCES AND LINKS

"Check the archive of corpora-list, as I'm sure, as you yourself state,
that this topic has been discussed before. Biber, Biber and Finegan, Leech,
Sinclair, Stubbs, Atkins and Clear and Ostler, and many others have
certainly written about this topic."

"for representativeness of oral corpora you can read introduction books to
quantitative sociology, as well as literature about Latin American language
atlases. Next year I [Dr. Uta Lausberg de Morales] will publish an article
in the journal "neue romania" (Berlin) about an oral corpus of Guatemalan
Spanish, and there I will allude to the problem of representativeness as
well (the article will be in German)."

Atwell E, Demetriou G, Hughes J, Schiffrin A, Souter C, and Wilcock S.
2000. A comparative evaluation of modern English corpus grammatical
annotation schemes. ICAME Journal, volume 24, pages 7-23, International
Computer Archive of Modern and medieval English, HIT Centre, Bergen
University. ISSN: 0801-5775

Bowker, L. Towards a methodology foe exploiting specialised target language
corpora as translation resources. International Journal of Corpus
Linguistics. Vol.5/1: 17-52.

Aquilino Sánchez and Pascual Cantos (1997) "Predictability of Word Forms
(Types) and Lemmas in Linguistic Corpora. A Case Study Based on the
Analysis of the CUMBRE Corpus: An 8-Million-Word Corpus of Contemporary
Spanish". International Journal of Corpus Linguistics 2/2: 259-280. (See
abstract http://solaris3.ids-mannheim.de/~ijcl/ijcl-2-2.html).

Sánchez, A. and P. Cantos (1998) "El ritmo incremental de palabras nuevas
en los repertorios de textos. Estudio experimental y comparativo basado en
dos corpus lingüísticos equivalentes de cuatro millones de palabras, de las
lenguas inglesa y española y en cinco autores de ambas lenguas". ATLANTIS,
19/2: 205-223.

Meyer, I., Mackintosh, K., The Corpus from a Terminographer's viewpoint.
International Journal of Corpus Linguistics. Vol.1/2: 257-285.

Williams, G. 1998. Collocational Networks: Interlocking Patterns of Lexis
in a Corpus of Plant Biology Research Articles. International Journal of
Corpus Linguistics. Vol.3/1: 151-171.

Williams, G. 1999. Looking in before looking out: Internal selection
criteria in a corpus of plant biology. Papers in Computational
Lexicography. Complex '99. Hungary: Budapest.: 195-204.

S Yang, Dan-Hee, Cantos, P. and Song, Mansuk (2000) "An Algorithm for
Predicting the Relationship between Lemmas and Corpus Size", ETRI Journal,
22/2: 20-31 (http://etlars.etri.re.kr/etrij/index.html)

The Corpus of Spoken Israeli Hebrew:
http://spinoza.tau.ac.il/hci/dep/semitic/maamad.html (Hebrew text)
http://spinoza.tau.ac.il/hci/dep/semitic/cosih.html (English text)

Have a look at
http://www.vicnet.net.au/~petek/thesis/

Try the archives at http://www.hit.uib.no/corpora/

( : ============================================= : )

Sampo Nevalainen, M.A.
Researcher
University of Joensuu
Savonlinna School of Translation Studies
P.O.Box 48
FIN-57101 Savonlinna
FINLAND

tel +358-15-511 70 (operator)
+358-15-511 7704
fax +358-15-515 096
email samponev@cc.joensuu.fi

Next message: Arno Scharl: "Corpora: Automatic Language Identification [Summary of Responses]"
Previous message: Andrew Kehoe: "Corpora: Job Vacancy - Research Associate/Fellow"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Aug 30 2001 - 15:03:47 MET DST