Re: Corpora: MWUs and frequency; try Relative Frequency

Andrew Harley (aharley@cup.cam.ac.uk)
Mon, 12 Oct 1998 11:11:55 +0100

At 02:15 PM 09/10/1998 -0700, Ted Dunning wrote:
>What Ms. Jones is advocating is an association ratio. The log of this
>quantity has (misleadingly) been referred to in the literature as
>mutual information. Church and Hanks investigated this measure and
>found that using it directly to investigate word association was
>unsatisfactory. Much of the problem with association ratios is that
>rare words which appear in proximity are given anomalously large
>scores.

I am interested to hear that "mutual information" is the wrong term for
this. What statistic would "mutual information" actually refer to in this
context, if any?

At Cambridge Unviersity Press, we have found a modification of what we
thought was MI to be very useful in lexicography for highlighting
significant collocates. Roughly speaking, we multiply the association ratio
("mutual information") again by the frequency of the collocate thus:
frequency(words together)^2
---------------------------------
frequency(wordA)*frequency(wordB)

Other factors also come into the equation we use: subcorpus size, average
gap or distance between the words, fixedness of position.

This multiplying again by co-occurrence frequency is useful in our
application as we only want frequent co-occurrences for including in our
dictionaries. We typically exclude co-occurrences occurring less than 3
times, but there can still be a small count problem, though trained
lexicographers can dismiss such cases easily.

>I describe in my 1993 CL paper a different (but related) measure which
>avoids most of the small count problems that are encountered in corpus
>analysis. The generalized log-likelihood measure that I advocated
>performs more effectively (subjectively) for finding multi-word units.

We find our measure works well too, subjectively. It would be great to
devise some more objective test to test out these various statistics.
Obviously, the test would have to have some application in mind, like
lexicography.

>In unpublished work with Paul Mineiro here at Aptex, I have extended
>this measure with results which appear to be better even than the
>original log-likelihood measure.
>Others have advocated measures such as the Dice coefficient. In a
>comparative study, Beatrice Daille compared a variety of association
>measures against a hand-annotated list and found that the
>log-likelihood ratio test came out closest to human judges
>performance. She evalued quite a number of different measures in her
>work, but exactly which ones I cannot recall.

I've also heard people tralk about chi squared measures for this.

Are there any web sites which describe all these different measures with
references to Web papers rather than printed papers?

Andrew Harley
Systems Manager - ELT Reference
Cambridge University Press

Direct line: (01223)325880

http://www.cup.cam.ac.uk/elt