Corpora: Summary: Corpus Linguistics and NLP

Tony Berber Sardinha (tony4@uol.com.br)
Wed, 9 Dec 1998 10:11:17 -0200

Hi,

below is a summary of the responses to my query about the relationship
between corpus linguistics, NLP, and humanities computing.

Thanks to everyone who responded.

Cheers
tony.

-- Query:
I am looking for references describing the differences and similarities
between the kinds of research carried out under the label 'Corpus
Linguistics' and research being done in other related fields such as NLP
and Humanities Computing. Questions one might ask about this: As these
areas co-exist, what are the characteristics that set them apart? Which
historical developments if any led to the establishment of these separate
areas? Are they going to merge and become one? Will there come a time when
there will be no more Corpus Linguistics as all linguistic research will be
corpus-based?

-- Replies:

****** Geoffrey Sampson (http://www.grs.u-net.com):

In a review of a book by Yngve which I contributed recently to
_Computational Linguistics_, I argued (what I believe to be true) that
"corpus linguistics" came to be an identified special subfield because
of the strange hostility of Noam Chomsky, and of those whom he influenced,
towards empirical data, and their belief that linguistic description can
and should be founded on speakers' "intuitions". Because Chomsky became
so massively influential, linguists who still wanted to work with
empirical data had to invent a "special-interest" title to shelter under;
but in the longer term it does not seem to me natural for "corpus
linguistics"
to be separate from linguistics in the wider sense.

****** Marco Antonio da Rocha (marcor@cce.ufsc.br):

I would say the ultimate research goals set them apart. Research people
into Linguistics are interested in investigating language phenomena per
se, often, as one may easily notice in theoretical linguistics approaches,
with little regard to everyday language. Corpus linguistics is certainly
different from theoretical linguistics in that respect, but still research
goals do not usually include ways of solving specific processing problems
in machines.
(...)
I have more trouble specifying what I understand by "Computing and
Humanities", to use the name of the periodical, but I would generally
expect a larger number of papers dealing with literary theory and
literary analysis based on the computational analysis of texts, authorship
investigations, including forensic linguistics, and a somewhat broader
variety of subjects within the so-called humanities, including, for
instance, anthropology and cultural studies (see McEnery and Williams,
Corpus Linguistics, for a discussion on applications in various fields of
research).

I would summarise these differences by saying that corpus linguistics
provides the basic principles and techniques for the investigation of
language facts using corpora. In NLP, these principles and techniques are
used to solve problems for the processing of natural language in machines.
The fact that NLP may get rather theoretical and thus contribute to the
formulation of principles does not change the fact that the study of
language still belongs in linguistics and not quite in NLP.

As to the use of corpus-based approaches to humanities, it seems even
clearer to me that this is an application of principles and techniques
developed within corpus linguistics, as these approaches analyse basically
the use of language in texts of any kind to support investigations with
implications in areas other than linguistics strictu sensu.

Which
> historical developments if any led to the establishment of these separate
> areas?

A precise answer to this question would require a more thorough historical
survey, but, looking, for instance, at the seminal work developed by
Francis and Kucera, again it appears to be the work of linguists which
happen to be heavy users (that famous joke with Kucera) of computers, so
much so that IBM had to support them with machines and information at that
time. NLP, for a long time, was mainly logic- or rule-based, although the
speech recognition people were already into Hidden Markov Models because
of the very nature of their work, which gave them little choice (see
Sampson, G. Evolutionary Natural Language Processing, for an interesting
discussion on that; see also Francis, Corpus Linguistics B.C., in
Svartvik, Directions in Corpus Linguistics, for some history).
(...)
Therefore, in spite of a number of intersections and gray areas, corpus
linguistics is concerned with a core of principles and techniques to
analyse language scientifically using the computer, whereas corpus-based
approaches to NLP uses these principles and techniques to solve natural
language processing problems in computers, often reaching conclusions
which are useful for further development of principles and techniques. The
intensive use of computers in corpus linguistics reinforces a tendency to
merger, but the research goals still seem different to me. The use of
these approaches in a broad range of applications, including for the
purpose of producing theory within each branch of humanities concerned,
could then be grouped under the "computing and humanities" title.
(...)

****** Lou Burnard (http://users.ox.ac.uk/~lou):

Funnily enough, I gave a presentation on this very topic at a recent
conference in Norway, but I haven't written it up yet. Maybe I should.

The gist of my argument was that Humanities Computing is not a whole
lot more than a ragbag of tricks and techniques without a recognition
of the importance of modelling and abstraction techniques. And that
corpus linguistics in particular enables and facilitates the
development of better models of language use.

I also quoted Michael Hoey's splendid remark at the end of his
presentation at TALC: "corpus linguistics is not a branch of
linguistics, but the route into linguistics".

-------------------------------
Dr Tony Berber Sardinha
Catholic University of Sao Paulo, Brazil
tony4@uol.com.br
http://sites.uol.com.br/tony4/homepage.html
http://homepages.infoseek.com/~corpuslinguistics/homepage.html
-------------------------------