Corpora: Book: NLP Using Very Large Corpora

Jean Veronis (Jean.Veronis@newsup.univ-mrs.fr)
Sun, 14 Nov 1999 11:40:47 +0100

**** NEW BOOK *** NEW BOOK *** NEW BOOK *** NEW BOOK *** NEW BOOK ****

KLUWER ACADEMIC PUBLISHERS
TEXT, SPEECH AND LANGUAGE TECHNOLOGY
Volume 11
Series editors: Nancy Ide and Jean Véronis

Natural Language Processing Using Very Large Corpora

edited by

Susan Armstrong
Kenneth Ward Church
Pierre Isabelle
Sandra Manzi
Evelyne Tzoukermann
David Yarowsky

The 1990s have been an exciting time for researchers working with large
collections of text. Text is available like never before. It was not all
that long ago that researchers referred to the Brown Corpus as a `large'
corpus. The Brown Corpus, a `mere' million words collected at Brown
University in the 1960s, is about the same size as a dozen novels, the
complete works of William Shakespeare, the Bible, a collegiate dictionary
or a week of a newswire service. Today, one can easily surf the web and
download millions of words in no time at all.

What can we do with all this data? It is better to do something simple than
nothing at all. Researchers in large corpora are using basically brute
force methods to make progress on some of the hardest problems in natural
language processing, including part-of-speech tagging, word sense
disambiguation, parsing, machine translation, information retrieval, and
discourse analysis. They are overcoming the so-called knowledge-acquisition
bottleneck by processing vast quantities of data, more text than anyone
could possibly read in a lifetime, and estimating all sorts of `central and
typical' facts that any speaker of the language would be expected to know,
e.g. word frequencies, word associations and typical predicate-argument
relations.

Much of this work has been reported at a series of annual meetings, known
as the Workshop on Very Large Corpora (WVLC) and related meetings sponsored
by ACL/SIGDAT (Association for Computational Linguistics' special interest
group on data). Subsequent meetings have been held in Asia (1994, 1997),
America (1995, 1996, 1997) and Europe (1995, 1996). The papers in this book
represent much of the best of the first three years of this
workshop/conference as selected by a competitive review process.

Kluwer Academic Publishers, Dordrecht

Hardbound
ISBN 0-7923-6055-9
November 1999
324 pp.
NLG 240.00 / USD 128.00 / GBP 79.00

---------------------------------------------------------------------
Contents

Introduction.

Implementation and Evaluation of a German HMM for POS Disambiguation; H.
Feldweg.

Improvements in Part-of-Speech Tagging with an Application To German; H.
Schmid.

Unsupervised Learning of Disambiguation Rules for Part-of-Speech Tagging;
E. Brill, M. Pop.

Tagging French without Lexical Probabilities - Combining Linguistic
Knowledge and Statistical Learning; E. Tzoukermann, et al.

Example-Based Sense Tagging of Running Chinese Text; X. Tong,
et al.

Disambiguating Noun Groupings with Respect to WordNet Senses; P. Resnik.

A Comparison of Corpus-based Techniques for Restoring Accents in Spanish
and French Text; D. Yarowsky.

Beyond Word N-Grams; F. Pereira, et al.

Statistical Augmentation of a Chinese Machine-Readable Dictionary; P. Fung,
D. Wu.

Text Chunking Using Transformation-based Learning; L. Ramshaw, M.P. Marcus.

Prepositional Phrase Attachment through a Backed-off Model; M. Collins, J.
Brooks.

On the Unsupervised Induction of Phrase-Structure Grammars; C. de Marcken.

Robust Bilingual Word Alignment for Machine Aided Translation; I. Dagan, et
al.

Iterative Alignment of Syntactic Structures for a Bilingual Corpus; R.
Grishman.

Trainable Coarse Bilingual Grammars for Parallel Text Bracketing; D. Wu.

Comparative Discourse Analysis of Parallel Texts; P. van der Eijk.

Comparing the Retrieval Performance of English and Japanese Text Databases;
H. Fujii, W.B. Croft.

Inverse Document Frequency (IDF): A Measure of Deviations from Poisson; K.
Church, W. Gale.

List of Authors.

Subject Index.

---------------------------------------------------------------------

PREVIOUS VOLUMES

Volume 1: Recent Advances in Parsing Technology
Harry Bunt, Masaru Tomita (Eds.)
Hardbound, ISBN 0-7923-4152-X, 1996

Volume 2: Corpus-Based Methods in Language and Speech Processing
Steve Young, Gerrit Bloothooft (Eds.)
Hardbound, ISBN 0-7923-4463-4, 1997

Volume 3: An introduction to text-to-speech synthesis
Thierry Dutoit
Hardbound, ISBN 0-7923-4498-7, 1997

Volume 4: Exploring textual data
Ludovic Lebart, André Salem and Lisette Berry
Hardbound, ISBN 0-7923-4840-0, December 1997

Volume 5: Time Map Phonology:
Finite State Models and Event Logics in Speech
Recognition
Julie Carson-Berndsen
Hardbound, ISBN 0-7923-4883-4, 1997

Volume 6: Predicative Forms in Natural Language and in
Lexical Knowledge Bases
Patrick Saint-Dizier (Ed.)
Hardbound, ISBN 0-7923-5499-0, December 1998

Volume 7: Natural Language Information Retrieval
Tomek Strzalkowski (Ed.)
Hardbound, ISBN 0-7923-5685-3, April 1999

Volume 8: Techniques in Speech Acoustics
Jonathan Harrington, Steve Cassidy
Hardbound, ISBN 0-7923-5731-0, July 1999

Volume 9: Syntactic Wordclass Tagging
Hans van Halteren (Ed.)
Hardbound, ISBN 0-7923-5896-1, August 1999

Check the series Web page for order information:

http://www.wkap.nl/series.htm/TLTB

---------------------------------------------------------------------