Re: Corpora: T-score in collocational analysis

Jem Clear (jem@cobuild.collins.co.uk)
Sun, 12 Dec 1999 17:19:23 GMT

Let me put Tony Berber Sardinha and Gordon and Pam Cain out
of their misery! The URL that Gordon and Pam Cain referred to
won't bring up the bit of text they were hoping to find. But
I reproduce it here below in its entirety. It's very short
and it was written off the top of my head as an aid to
people who use the CobuildDirect corpus facilities and who
used to email me asking "Er.... what are those MI and t-score
numbers that appear on the screen when I ask for collocations?"

Noe of what follows is my own: it is a potted summary of what I
understood from Ken Church, from whom I got the t-score formula.
Ken Church and Bill Gale (et al) used this statistic over a decade
ago and published papers on its use. Annoyingly (in this new
cyber world) I routinely refer people to

Church, K.W., W.Gale, P.W.Hanks, D. Hindle "Using Statistics in
Lexical Analysis" in Uri Zernik (ed.) <italic>Lexical Acquisition:
Using On-line Resources to Build a Lexicon</italic>. Hillsdale:
Lawrence Erlbaum, 1991.

and I routinely get a reply that goes like "Oh, I can't seem to find
this paper -- do you know anywhere else I can read up about this
MI and t-score stuff?". Partly for this reason I wrote the
lightweight (and possibly inaccurate) "quick guide" below. I am
*not* a statistician.

Cheers

Jem Clear
Electronic Development Director phone: +44 (0)121-414-3926
Collins Dictionaries fax: +44 (0)121-414-6203
Westmere, 50 Edgbaston Park Road email: jem@cobuild.collins.co.uk
Birmingham, B15 2RX, UK WWW: www.cobuild.collins.co.uk

----------

The two statistical measures of significance which are used by the
collocations feature of the CobuildDirect service are explained below
in layman's terms. It is not really possible to explain the complete
statistical background to the use of Mutual Information and t-scores
here.

The output you see will be in four columns. The first column lists
each collocate. The second column shows the total independent
frequency of that collocate in the corpus. The third column shows the
frequency with which the node and the collocate appear together
(i.e. withing the specified span) in the corpus. The fourth column
shows the statistical significance score (either Mutual Information or
t-score as selected by the user). The collocates are listed in
descending order of significance.

-----------------------

Let us work through some example data (taken from a 20m word corpus)
for the word "post".

It co-occurs with many words, among which are "the", "office"
and "mortem".

The observable facts are that "post" has an overall corpus freq of
2579 (let's refer to this as f(post)=2579) and also

f(office) = 5237
f(the) = 1019262
f(mortem) = 51

We also observe the number of times these words co-occurred with
"post" (for shorthand I'll write j(the) = 1583 to mean that "the"
occurred with "post" 1583 times: this is the "joint" frequency). So

j(the) = 1583
j(office) = 297
j(mortem) = 51

Now if we were to list the collocates of "post' by raw frequency of
co-occurrence we would order them according to j(x), as above. Of
course, a full collocation listing of "post" in this form would have
many other words with intermediate frequencies -- we are just
focussing on these three words for the moment. But the ordering shown
above doesn't tell us anything much about the strength of association
between "post" and these other words: it is simply a reflection of the
basic overall frequency of the collocating words (i.e. "the" is much
more frequent than "office" which is much more frequent than
"mortem"). We just showed that in the f(x) list! This is true in
general: ordering collocates by j(x) simply places words like "the",
"a", "of", "to" at the top of every collocate list. What we would like
to know is

------------------------------------------------------------------------
IMPORTANT QUESTION: to what extent does the word "post" condition
its lexical environment by selecting particular words with which it
will co-occur?
------------------------------------------------------------------------

We can compare the relative frequencies of what we observed with what
we would expect under the null hypothesis:

------------------------------------------------------------------------
NULL HYPOTHESIS: the word "post" has no effect whatsoever on its
lexical environment and the frequencies of words surrounding "post"
will be exactly (give or take random fluctuation) the same as they
would be if "post" were present or not.
------------------------------------------------------------------------

That is, if "the" has an overall relative frequency of 1 in 20 (about
1m occurrences in a 20m word corpus -- see f(the) above) then we can
expect "the" to occur with the same relative frequency in a subset of
the corpus which is the 4 words either side of "post": hence under the
null hypothesis we would expect j(the) to be

( f(post) * span ) * relative_freq(the)

which is

(2579 * 8) * (1 / 20) = 20632 / 20 = 1031

So under the null hypothesis we would expect j(the) to be 1031. We
actually observed j(the) to be 1583, which is rather higher, and we
could simply express the difference as ratio (of observed to expected
joint frequency) thus:

1583/1031

This is the Mutual Information score and it expresses the extent to
which observed frequency of co-occurrence differs from expected (where
we mean "expected under the null hypothesis"). Of course, big
differences indicate massive divergence from the null hypothesis and
indicate that "post" is exerting a strong influence over its lexical
environment.

BUT BUT BUT! there is Big Problem with Mutual Information: suppose
the word "egregious" appears just once with "post" (not an
unreasonable event) in the corpus. And "egregious" may have a very low
overall freq:

f(egregious) = 3

Now we carry out the sums to calculate the expected j(egregious)
figure. I can assure you it will be a small number! It is:

( f(post) * span ) * relative_freq(egregious)

(2579 * 8) * ( 3 / 20000000)

= 0.0030948

Now you'll see that even if "egregious" occurs just once in the
vicinity of "post" the observed j(egregious) will be 323 times more
than the expected joint frequency, and the mutual information value
will be high. Common sense tells us that since words cannot appear
0.0030948 times -- they either occur zero or one times, nothing in
between -- that claiming that "post"+"egregious" is a significant
collocation is rather dubious.

In general, the comparison of observed j(x) and expected j(x) will be
very unreliable when values of j(x) are low; this is common sense,
too. Just because I've seen these two words together once in 20m words
doesn't give me much confidence that they are strongly associated: I'd
need to see them together several times at least before I could start
to feel at all secure in claiming that they have some sort of
significant association.

Now here comes T-score. We can calculate a second-order statistic
which is, crudely, this:

------------------------------------------------------------------------
IMPORTANT QUESTION: how confident can I be that the association that
I've measured between "post" and "egregious" is true and not due to
the vagaries of chance?
------------------------------------------------------------------------

T-score answers this question. It takes account of the size of j(x)
and weights its value accordingly. A high T-score says: it is safe
(very safe/pretty safe/extremely secure etc according to value) to
claim that there is some non-random association between these two
words.

So t-scores are higher when the figure j(x) is higher. In the case of
"egregious" we would get a very low t-score. In the case of "the" the
t-score might be quite high, but not huge because "the" doesn't have
that strong an association with "post". "office" gets a really high
t-score because not only is the observed j(office) way higher than
expected, but we seen a goodly number of such co-occurrences, enough
to be pretty damn sure that this can't be due to some freak of chance.

In practical terms, raw frequency or j(x) won't tell you much at all
about collocation: you'll simply discover what you already knew that
"the" is a *very* frequenct word and seems to co-occur with just about
everything. MI is the proper measure of strength of association: if
the MI score is high, then observed j(x) is massively greater then
expected, BUT you've got to watch out for the low j(x) frequencies
because these are very likely to be freaks of chance, not consistent
trends. t-score is best of the lot, because it highlights those
collocations where j(x) is high enough not to be unreliable and where
the strength of association is distinctly measurable.

Try the different measures: you'll soon see the difference. Raw freq
often picks out the obvious collocates ("post office" "side effect")
but you have no way of distinguishing these objectively from
frequent non collocations (like "the effect" "an effect" "effect is"
"effect it" etc). MI will highlight the technical terms, oddities,
weirdos, totally fixed phrases, etc ("post mortem" "Laurens van der
Post" "post-menopausal" "prepaid post"/"post prepaid" "post-grad")
T-score will get you significant collocates which have occurred
frequently ("post office" "Washington Post" "post-war", "by post"
"the post").

If a collocate appears in the top of both MI and t-score lists it is
clearly a humdinger of a collocate, rock-solid, typical, frequent,
strongly associated with its node word, recurrent, reliable, etc etc
etc.

Jem Clear
June 1995