Re: Corpus theory

Robert Luk (csrluk@comp.polyu.edu.hk)
Sat, 12 Oct 1996 11:13:34 +0800

> I should read some literature on the philosophy or theory of
> corpus linguistics for my thesies. However most publications
> seems to concern merely on more practical or methodological
> aspects of corpus linguistics. Could anyone recommend me some
> books, articles or papers on philosophy of corpus linguistics?

There are many work in statistical approach for corpus linguistics
work like extraction of collocation or n-gram and I couldn't help thinking that
there should be a probabilistic theory or something behind. As someone
has remarked that the corpus is a sample.And there is a statistical
theory of sampling.

When I was working in the Department of Chinese, Translation and
Lingusitics, at CityU Hong Kong, I had a model but
seldom articulated it. I felt I am being misunderstood as a self-made
linguist and a Computer scientist or technician working in a language
department. I was thinking that one day I can write something about a model
of linguistics as a scientific study of language (which my colleagues told me
that it is about spoken language). Since I am no longer supported in that
department, my desire has dwindled a bit in this direction. Anyway, it is Saturday today,
it seems nice to throw away this informal model to the public:

Def 1
Theory -> principles of explaining things (Newtonial mechanics - three laws)

Def 2
Model -> a simplified and formal description (A model of projectile)

Linguistics is a scientific study of language rather than spoken langauge because the
nature of messages get generated and interpreted depends on the context of communication. Speech in dialogue and in reading aloud are
very different. It follows that if linguistics is the scientific study of spoken language, it precludes the observations of language production and analysis in other environment which might require "different methods" of language production and analysis employed by the speaker and listener than the spoken language.

A scientific study demands a formal description (model) of the subject and empirical data to
support the formal description. If there are underlying principles to explain the cosntruction of various models, we call the set of principles the theory of the particular subject under investigation. The formal description makes our discussion more precise and help us to formulate
hypothesis. To test hypothesis, we need to sample the population to test empirically
whether the hypothesis is supported or not (i.e. balancing act of rationalist and empirical linguists).

A corpus is a sample S of the recorded messages occurred in a language community L
(which can be a set of smaller communities).The underlying population P is all the messages generated by L for all valid interval of time T (note language change). The corpus
linguist is the observer who record these messages (not necessary in T).

+-<-----+ observation +------+ analysis
| | messages --->--------------|Sample|-----------> Corpus Linguist
| | by sampling +------+
+-- L --+ S
Figure 1: A model of linguistic investigation

Some of the literature address the problem of balanced or representativness of
corpora. This is addressing the problem of sampling (introducing bias or not).
Seldom there is work on justifying the sample size and there are different
strategy of sampling depending on the methods and aprior structures of L.

Our observations are not the actual messages. They are the different physical
realisations of these messages that carry them. For example, in speech, we
register the acoustic signal not the actual message inside the head of the
speaker. Likewise, the message that get received may not be decoded as the
same as intended! So, like nuclear physics, we do not have the apparatus
in the experiment to get deeper or more precise than the phenomenon that
we are interested in. There is an inherent limit in our observations.

In Shannon's theory of communication, we take the messages generated by
members in L as a source. We also disregard the process of sampling for
simplification. We then take messages to be the recorded signals (text or
acoustic speech) which becomes a stream of objects. Since Shannon paper was
more on text, each object derives from a finite alphabet.

In the above model, the messages are hidden unlike Shannon's model. There is
a clear cut boundary of messages to the speaker and most of the time the listener.
But, the observer does not know where even the boundaries of messages are. This
give rise to the problem of segmentation whether it is at the sentence level or
at the word level. For paragraphs, boundaries might be more apparent but in speech
interaction, the boundaries may not be and even the messages get formed during
the interactions!

There could be multiple messages encoded into the observed signal. It would
be more complete to specify what type of messages we are interested in during an
investigation.

There is an underlying variability in language because the messages are generated
by a group of people in L. These messages are generated not from some void but
from the experience of the people, their thoughts, feeling, etc. and these messages
could affect their language production (e.g. language learning, code mixing, etc.).
These factors can be out of the control of observation and therefore should be
treated as underlying variability. Variability also exists across different
people using different methods of generating and interpresing messages as well as
different realisation methods of messages into physical signals using different physical
systems. Out of these variabilities, people can still understand messages sent indicating
that some structures are carried by the physical signal. Statistical model provides a means
to describe these variabilities and structures at the same time. It is one type of
model for the scientific study of lanaguge.

Lets get specific: Collocations of two words by Church and Hanks

- Population P: all messages realised as text strings and produced by members in L
- Messages: Assume messages (semantics) are encoded as a sequence of words. The smallest
message is a word. Messages are generated independent of each
other so that p(M_t, M_t+1) = p(M_t) . p(M_t+1)
- Sample S: Estimate the sample size needed
Collect an unbiased sample
- Hypothesis testing:
Ho: p(W_a, W_b) belongs to different messages because
M_a => ...W_a and M_b => W_b...
H1: p(W_a, W_b) belongs to the same message because M_a => ... W_a W_b ...

This can be generalised to collocations within in a bounded context.

I regarded this view is more akin to the latter Wittgenstein idea of language games.
The earlier Wittgenstein regards language as a model.

Best,

Robert Luk
Dept. Computing
Hong Kong Polytechnic University