Corpora: New book in French : les linguistiques de corpus

Adeline Nazarenko (adeline.nazarenko@lipn.univ-paris13.fr)
Thu, 15 Jan 1998 13:05:12 GMT

Apologies to those of you who receive multiple copies

New book
........

LES LINGUISTIQUES DE CORPUS

Benoît Habert
Adeline Nazarenko
André Salem

Armand Colin/Masson, Paris 1997, 240p.
ISBN 2 200 0 17 75 - 8

This book presents a large scale of recent works in the field of automatic
text processing. It describes the main types of computerized resources that
are currently available: corpora of texts submitted to morphological,
syntactical or semantical annotations, resources from dictionaries,
procedures to automatically or semi-automatically enrich texts gathered
within a corpus. The combined use of those resources is illustrated by
examples borrowed from real research led in very different fields. Beyond the
community of linguists and that of the automatic processing of language, the
book concerns lexicographers, didacticians, content analysts ... as well as
all those who face the study of language, discourse or texts in their work.

METHODOLOGY
===========

The work is divided into three sections. The first one focuses on the tagged
corpora and the other available textual resources. The second one deals with
other aspects of corpus linguistics : the study of meaning, diachrony and
aligned texts. The work ends with the methodological and technical problems,
the former being more abstract and the latter more short-termed.

The numerous bibliographical references bear witness to the intensive
research and development activity centered around electronic corpora. They
include transcripts of lectures and even technical reports.

1. Tagged corpora and their use
-------------------------------

The first chapter deals with tagged corpora: morphosyntactical tags are
associated with the words. Chapter II concerns arborescent corpora :
syntactical representations embellish the sentences.

Within each of these chapters, we will first briefly present the type of
annotation at stake. The corpora shown at the end of that introduction are
used to explain examples in which we respect the awkward nature of the
current annotations. At the same time we try to give a unified representation
for each type of annotation so as to be able to compare the formats actually
used. And there is a great variety of them. Indeed, the differences in
annotations often prevent one from being aware of the real divergences or
convergences. Secondly we develop some examples of linguistic research that
were made possible by that type of annotation and that seem particularly
promising. Thanks to these examples, we want to show in a straightforward
manner what the different types of possible annotations can contribute,
without complicating the understanding of stakes with technical problems.

Chapter III describes other important textual resources, i.e. lexical
resources in an electronic form.

2. Transversal dimensions
-------------------------

The fourth chapter, devoted to semantical approaches, shows how to extract
lexicographical knowledge and how to disambiguate the meaning of words in
context.

Chapter V explains how corpora can be used in a diachronical perspective,
over long or on the contrary short periods of time. It indicates the
difficulties characterising the constitution of historical corpora and the
methodological precautions that are necessary when using them.

Chapter VI describes aligned texts, i.e. a text written in a language and
presented parallely with its translation in one or several languages.

3. Methodologies and techniques
-------------------------------

The last section groups together methodological questions and technical
information.

Getting to know studies using corpora first helps to better grasp what is at
stake in the constitution of a corpus and the methodological choices it
requires, in particular concerning the norms used to facilitate the exchange
and the re-use of textual data (SGML, TEI). That is the main subject of
Chapter VII.

Trying to avoid hermetism and well aware that it is probably the point where
evolution is the fastest and the most difficult to anticipate, we present in
chapter VIII the techniques of tagging and syntactical analysis, those of
semantical annotations, as well as the cleaning and segmentation of textual
data.

Chapter IX briefly presents the quantification of language facts.

AUTHORS
=======

Benoit HABERT
ENS de Fontenay St Cloud
31 avenue Lombart
F-92260 Fontenay-aux-Roses
bh@ens-fcl.fr

Adeline NAZARENKO
LIPN
Institut Galilee
Universite Paris-Nord
avenue Jean-Baptiste Clement
F-93430 Villetaneuse
nazarenko@lipn.univ-paris13.fr

Andre SALEM
ILPGA
Sorbonne nouvelle - Paris 3
19 rue des Bernardins
F-75005 Paris
salem@msh-paris.fr

TO ORDER
========

Send your order to
Armand Colin
BP 130
F-75223 Paris Cedex 05
tél. (33) 01 40 46 60 59
fax. (33) 01 40 46 60 19

Mention the following information :
Title Les linguistiques de corpus
Authors Benoit Habert, Adeline Nazarenko, Andre Salem
ISBN 2 200 0 17 75 - 8
Price 125,00 FF (taxes included)

Join a check libelled to Masson. For shipping and handling fees, please add :
domestic : 20 FF for the first volume
10 FF for each additional copy
others : 30 FF for the first volume
10 FF for each additional copy