Corpora: Corpus Linguistics Methodologies. Was: Corpus Linguistics User Needs

Marc Weeber (M.Weeber@farm.rug.nl)
Mon, 3 Aug 1998 15:19:25 +0200 (CEST)

hello Oliver, Ylva and all the other corpus people,

First, I would like to thank Oliver and Ylva for starting this very
interesting thread. I think one of the discussion items is whether
linguists should implement the earlier mentioned `bag of tricks'
themselves, for instance, how to put corpus (frequency) data into SPSS
readable format. But I would like to add an extra dimension to the
discussion: in what way should corpus linguists _understand_ the bag of
tricks? More specifically: in what way do corpus linguists understand the
mathematical/statistical aspects of corpus linguistics.

At the risk of being off-topic to this thread, I'll explain myself. In my
own research, I started with the application of some `standard' corpus
frequency analysis techniques. I looked at the IMS workbench, WordSmith
and used some of Ted Dunning's tools he used in his 1993 Computational
Linguistics article. The outcome of our research showed some unexpected
but interesting phenomena. In the end, we could explain (and predict)
these phenomena by the statistics we used (our paper is in prep.).

The previous paragraph may be a bit cryptic, so here is an example. The
WordSmith toolbox computes "significant words". The statistic used is
Chi^2. Dunning discussed the use of this statistic in his 1993. It has
also been discussed in this forum in 1996 (contact Tony Berber Sardinha,
tony4@uol.com.br, he started the "corpus comparison" discussion AFAIK).
Dunning proposed the log-likelihood ratio as a better statistic for word
frequency analysis. He suggested that exact tests may even be better.
Pedersen et al (1996, http://www.seas.smu.edu/~pedersen/aaai96-cmpl.ps.gz)
dug into this suggestion.

However, up to now, I see little impact of Dunning's and Pedersen's
findings in Corpus Linguistics. I know that the statistical aspects are
not every linguist's cup of tea, but a mere application of
supposed standards will not do anymore: the researcher needs to really
understand the techniques he/she uses. And because there are not many well
established standards in Corpus Linguistics, as pointed out by Oliver
Mason, the researcher should know the background and should, ideally, be
capable of implementing new insights him/herself. If the corpus linguist
waits for a new version of Wordsmith, he/she should know that he/she is
using flawed methodologies.

To wrap things up: I think it is very important for a corpus linguist to
really understand the bag of tricks. And because corpus linguistics has no
(not yet?) bag of standard tricks, corpus linguists should have
programming skills. In an educational view: corpus linguists should rathar
attend courses as CIMQL (Second Workshop in Computationally-Intensive
Methods in Quantitative Linguistics, http://www.stats.gla.ac.uk/~cimql/)
than yet another C++ programming course.

regards,

Marc

---------------------------------------------------------
Marc Weeber _______ http://www.farm.rug.nl/marc/
marc@farm.rug.nl | ICQ# 13846351
---------------------0-----------------------------------