(no subject)

Ronald J. Flintham (flintham@paris7.jussieu.fr)
Sun, 13 Oct 1996 15:25:24 +0200

Robert LUK {csrluk@comp.polyu.edu.hk (Robert Luk \(COMP staff\))} said
recently (12th october 96):

When I was working in the Department of Chinese, Translation and
Lingusitics, at CityU Hong Kong, I had a model but
seldom articulated it. I felt I am being misunderstood as a self-made
linguist and a Computer scientist or technician working in a language
department. I was thinking that one day I can write something about a model
of linguistics as a scientific study of language (which my colleagues told m=
e
that it is about spoken language). Since I am no longer supported in that
department, my desire has dwindled a bit in this direction. Anyway, it is
Saturday today,
it seems nice to throw away this informal model to the public:

Def 1
Theory -> principles of explaining things (Newtonial mechanics - three laws)

Def 2
Model -> a simplified and formal description (A model of projectile)

Linguistics is a scientific study of language rather than spoken langauge
because the
nature of messages get generated and interpreted depends on the context of
communication. Speech in dialogue and in reading aloud are
very different. It follows that if linguistics is the scientific study of
spoken language, it precludes the observations of language production and
analysis in other environment which might require "different methods" of
language production and analysis employed by the speaker and listener than
the spoken language.

A scientific study demands a formal description (model) of the subject and
empirical data to
support the formal description. If there are underlying principles to
explain the cosntruction of various models, we call the set of principles
the theory of the particular subject under investigation. The formal
description makes our discussion more precise and help us to formulate
hypothesis. To test hypothesis, we need to sample the population to test
empirically
whether the hypothesis is supported or not (i.e. balancing act of
rationalist and empirical linguists).

A corpus is a sample S of the recorded messages occurred in a language
community L
(which can be a set of smaller communities).The underlying population P is
all the messages generated by L for all valid interval of time T (note
language change). The corpus
linguist is the observer who record these messages (not necessary in T).

+-<-----+ observation +------+ analysis
| | messages --->--------------|Sample|-----------> Corpus Linguist
| | by sampling +------+
+-- L --+ S
Figure 1: A model of linguistic investigation

Some of the literature address the problem of balanced or representativness =
of
corpora. This is addressing the problem of sampling (introducing bias or not=
).
Seldom there is work on justifying the sample size and there are different
strategy of sampling depending on the methods and aprior structures of L.

Our observations are not the actual messages. They are the different physica=
l
realisations of these messages that carry them. For example, in speech, we
register the acoustic signal not the actual message inside the head of the
speaker. Likewise, the message that get received may not be decoded as the
same as intended! So, like nuclear physics, we do not have the apparatus
in the experiment to get deeper or more precise than the phenomenon that
we are interested in. There is an inherent limit in our observations.

In Shannon's theory of communication, we take the messages generated by
members in L as a source. We also disregard the process of sampling for
simplification. We then take messages to be the recorded signals (text or
acoustic speech) which becomes a stream of objects. Since Shannon paper was
more on text, each object derives from a finite alphabet.

In the above model, the messages are hidden unlike Shannon's model. There is
a clear cut boundary of messages to the speaker and most of the time the
listener.
But, the observer does not know where even the boundaries of messages are. T=
his
give rise to the problem of segmentation whether it is at the sentence level=
or
at the word level. For paragraphs, boundaries might be more apparent but in
speech
interaction, the boundaries may not be and even the messages get formed duri=
ng
the interactions!

There could be multiple messages encoded into the observed signal. It would
be more complete to specify what type of messages we are interested in=
during an
investigation.

There is an underlying variability in language because the messages are
generated
by a group of people in L. These messages are generated not from some void b=
ut
from the experience of the people, their thoughts, feeling, etc. and these
messages
could affect their language production (e.g. language learning, code
mixing, etc.).
These factors can be out of the control of observation and therefore should =
be
treated as underlying variability. Variability also exists across different
people using different methods of generating and interpresing messages as
well as
different realisation methods of messages into physical signals using
different physical
systems. Out of these variabilities, people can still understand messages
sent indicating
that some structures are carried by the physical signal. Statistical model
provides a means
to describe these variabilities and structures at the same time. It is one
type of
model for the scientific study of lanaguge.

Lets get specific: Collocations of two words by Church and Hanks

- Population P: all messages realised as text strings and produced by
members in L
- Messages: Assume messages (semantics) are encoded as a sequence of words.
The smallest
message is a word. Messages are generated independent of eac=
h
other so that p(M_t, M_t+1) =3D p(M_t) . p(M_t+1)
- Sample S: Estimate the sample size needed
Collect an unbiased sample
- Hypothesis testing:
Ho: p(W_a, W_b) belongs to different messages because
M_a =3D> ...W_a and M_b =3D> W_b...
H1: p(W_a, W_b) belongs to the same message because M_a =3D> ... W_a
W_b ...

This can be generalised to collocations within in a bounded context.

I regarded this view is more akin to the latter Wittgenstein idea of
language games.
The earlier Wittgenstein regards language as a model.

Best,

Robert Luk
Dept. Computing
Hong Kong Polytechnic University

****************************************************************************=
*
It occured to me that anyone interested in similar theoretical
preoccupations about the nature of language and "communication" might like
to read the work of the French linguist Antoine Culioli, in particular some
of the articles in English in "Pour une linguistique de l'=E9nonciation" Tom=
e
1,(Ophrys, Paris,1990): for example "Representation, referential processes
and regulation: language activity as form production and recognition".
Ronald Flintham.