Re: Corpora: MWUs and frequency; try Relative Frequency

Philip Resnik (resnik@umiacs.umd.edu)
Mon, 12 Oct 1998 10:26:18 -0400 (EDT)

Andrew Harley <aharley@cup.cam.ac.uk> wrote:
> I am interested to hear that "mutual information" is the wrong term for
> this. What statistic would "mutual information" actually refer to in this
> context, if any?

One possible source of terminological confusion (though I'm not sure
if this is what Ted meant) is the fact that "mutual information" in
information theory relates two random variables X and Y (e.g., see
Cover, T. and J. Thomas, Elements of Information Theory, Wiley, 1991):

I(X,Y) = Sum_{x,y} Pr(x,y) log [ Pr(x,y) / Pr(x)Pr(y) ]

This is equal to the expected value of the "misleadingly" termed
quantity:

I(x,y) = log [ Pr(x,y) / Pr(x)Pr(y) ]

One solution to this confusion that I've seen is to refer to the
former quantity as "average mutual information" and the latter as
"pointwise mutual information", which seems as good a way to go as
any, since completely renaming either quantity is a practical
impossibility.

Incidentally, the modification of their MI-like association ratio,

frequency(words together)
frequency(words together) * ---------------------------------,
frequency(wordA)*frequency(wordB)

seems related in spirit to a measure of association that I proposed,
which I have called "selectional association" (because it was
developed in the context of a model of selectional preferences,
e.g. of verbs for their arguments). It can be written:

Prob(x and y together)
A(x,y) = (1/Norm) Prob(x|y) log ---------------------------
Prob(x alone)*Prob(y alone)

where Norm(alization) is the sum of A(x,y) over all x. Like Andrew, I
multiplied the association ratio I was using (in my case, pointwise
mutual information) by an additional application of frequency (here,
the conditional probability of x given y). This had better behavior
than pointwise mutual information for similar reasons (avoiding
problems associated with low-frequency values of x, given y; note the
asymmetry). [P. Resnik, (1996) "Selectional constraints: an
information-theoretic model and its computational realization",
Cognition 61, pp. 127-159.]

Philip

----------------------------------------------------------------
Philip Resnik, Assistant Professor
Department of Linguistics and Institute for Advanced Computer Studies

1401 Marie Mount Hall UMIACS phone: (301) 405-6760
University of Maryland Linguistics phone: (301) 405-8903
College Park, MD 20742 USA Fax : (301) 405-7104
http://umiacs.umd.edu/~resnik E-mail: resnik@umiacs.umd.edu