Corpora: corpora and software

Celia Pereira Caldas (ccaldas@uerj.br)
Wed, 29 Jul 1998 09:04:24 -0300

Let me try to convey some of my experience as a research student at
Sussex, in which this problem of writing programs for research purposes
came up a few times at crucial points during work for a PhD.

I am a fairly experienced user of computers and I do not suffer from
technophobia, but I am a linguist and I cannot write programs, although
I do manage to handle some 'inventions' combining Unix commands and the
like. So that I made several victims during my research, that is, I
convinced people at the School of Cognitive and Computing Sciences,
Sussex, to write programs for my research purposes. That includes, I am
afraid, Geoffrey Sampson himself, who was my supervisor and a victim in
more than ways one, having written a program to do "qualified" word
counting, that is, eliminating some undesirable elements in the London
Lund codification from the counting, so as to get a more precise total
of words, avoiding the inclusion of code like <2syl> and the line
numbers.

Out of this experience, I would suggest a "middle of the road" approach,
although I do think Geoff is quite right in general. It is just that
pressures of the moment, such as the rush to complete your PhD with
funds running out, may make a detour to learn programming extra hard. I
will list some of the needs I had which I believe may be fairly general
as research needs for corpus linguists.

1. Qualified word counting, that is, eliminating specified items

2. Interfacing to statistical packages. Miles Dennis wrote a program
that picked my annotation out of the corpora, put it in tabulated
columns and then, through "sed" and "cut" Unix commands, transformed it
all in numbers ready to be fed into SPSS. It works like a dream.

3. Word frequency counts that can do the counting in specified bits of
the corpora, without having to cut and paste the bits in separate files.
Stephen Eglen wrote software that, combined with Berkeley HUM, gives me
frequency counts for every twenty-line bit. This was most helpful for
topic tracking and discourse segmentation.

4. Any software that gives you probabilities for combination of
categories in a classification. I had four properties to classify
anaphora cases. Luis Gonzalez wrote a piece of PERL software that would
give me, out of aggregate counts, the probability that an anaphoric
demonstrative pronoun had of having a explicit or implicit antecedent,
of having a discourse-chunk or noun phrase antecedent, and other details
I need not specify here. I believe there could be a program to do that
for any kind of classification of linguistic phenomena involving three
or more properties.

Alternatively, you may develop some talent to smile, buy cups of coffee,
complain loudly at the PhD lab and other techniques to get people to
write bits of software for you. Of course this is easier at COGS, where
linguists and computer people are inside the same building.

Marco Rocha

Marco A.E. Rocha
ccaldas@uerj.br