Teaching corpus linguistics

Marti Hearst (hearst@parc.xerox.com)
Wed, 17 Jan 1996 10:12:47 PST

During the spring term I am going to teach under the headline "CORPUS
LINGUISTICS", and wondered if any out there had a few up to date articles
on that subject. I have planned to use Sinclair, John: "Corpus,

I've enclosed a relevant book review.
Marti

Using Large Corpora. S. Armstrong (Ed.). MIT Press (A Bradford book),
Cambridge, MA (1994). viii + 349pp, $37.50, ISBN 0-262-51082-0.

Review by Marti A. Hearst, in
Journal of Information Processing and Management, 31 (5), 1995.

Statistical, corpus-based approaches to natural language analysis have
recently become widespread in the computational linguistics community.
Because these empirical algorithms tend to be more robust, have better
coverage, and require much less manual coding than classic natural
language processing algorithms, this shift in the field may prove very
important to the information access community.

Although the merit of intensive language analysis for standard
information retrieval algorithms is still up for question, there is
growing evidence that corpus-based computations, such as
collection-sensitive thesaurus creation, can improve results. In
addition, some more fine-grained tasks, such as question-answering,
information filtering, and multi-lingual retrieval, seem to benefit
from the robust, shallow analysis that corpus-based methods can
provide.

A useful resource for those wishing to learn about this relatively new
field can be found in {\it Using Large Corpora}, edited by Susan
Armstrong. Essentially, this book contains the contents of a
two-volume special issue of the journal {\it Computational
Linguistics} (published in 1994) which explores the use of
large text collections in empirical language analysis. The book
consists of a preface by the editor, thirteen technical papers, and a
helpful index.

The introductory chapter, by Church and Mercer, is by far
the best published introduction to this burgeoning field. In the
space of 24 pages, it describes the historical precendents (including
the strong influence of the speech analysis community and the
empirical vs. rationalist debate), provides a useful, coherent, and
correct exposition of the basic statistical assumptions and techniques
popularly used in the field, and discusses general issues such as
corpus size and coverage.

The remaining papers cover a variety of tasks and algorithms, and
provide references for many techniques that are not addressed
explicity in the book. The papers by Biber, Brent, Marcus et al.,
Pustejovsky et al., and Smadja explore various aspects of lexical
analysis and lexical acquisition. Brown et al. have contributed an
important paper, previously difficult to obtain, about statistical
machine translation at IBM Watson. Papers by Kay and Roscheisen, and
Gale and Church, describe algorithms for multi-lingual sentence
alignment (that is, determining the correspondences between sentences
in a document with sentences in translations of that document).
Briscoe and Carroll describe algorithms for probabilistic
unification-based LR parsing. Disambiguation is the topic of two
papers: Hindle and Rooth describe an approach to prepositional phrase
attachment using an ingenious method for bootstrapping the statistics
from unlabeled training data, and Weischedel et al. tackle several
different disambiguation problems. Finally, the paper by Dunning
shows that the standard assumption of an underlying normal
distribution for frequencies of term co-occurrences is flawed when the
frequencies are small, and suggests using likelihood ratios with
underlying multinomial distributions as an alternative.

The papers in this collection are well-written, well-edited, and
provide useful background information. Most of the algorithms are
implementated and evaluated on real text collections. Many of the
papers address difficult issues, such as the smoothing of estimates to
account for small frequencies. However, a small point of criticism
can be made: the papers refer to the two volume special issue, and
there is no acknowledgement of the transformation from journal to book
form, not even in the preface.

This collection should be required reading for anyone interested in
working in computational linguistics today. It should also be a
valuable reference for a graduate course in the field, although some
of the papers are technically challenging. The information access
researcher who wants to gain an understanding of current developments
in the field should definitely read the introductory chapter, and at
least leaf through the other chapters in order to get an impression of
current techniques and capabilities.