Re: Corpora: log likelihood statistic

Ted E. Dunning (ted@aptex.com)
Mon, 21 Jul 1997 16:09:58 -0700

Daniel,

actually, there are simpler formulations for the log likelihood ratio
statistic that i have become aware of since writing that article.

one formulation that is good because it is relatively mnemonic is the
following:

-2 \log \lambda = 2 N \left[ H(table) - H(row sums) - H(column sums) \right]

here, H(table), H(row sums) and H(column sums) are the entropies of
the entire contingency table and the row sums and column sums
respectively as estimated using maximum likelihood. note that you
should measure bits in nats rather than bits (use log rather than
log_2) to make the units come out right with this form. another
formulation which is nice is

-2 \log \lambda = 2 \sum_{ij} k_{ij} \log {\pi_{ij} / \mu_i}

where k_{ij} is the count in table cell ij, and the \pi's and \mu's
are the cell and column probabilities

\pi_{ij} = k_{ij} / \sum_i k_{ij}

and

\mu_i = \sum_j k_{ij} / \sum_{ij} k_{ij}

if this last formula is expanded in terms of row and column sums
R_j = \sum_i k_{ij} and C_i = \sum_j k_{ij}, and the total number of
observations is written as N = \sum_{ij} k_{ij}, you get

- 2 \log \lambda = 2 \sum_{ij} k_{ij} \log {\frac {k_{ij} N} {C_i R_j}}

(rendering this last formula for non-textites):

---- k N
\ ij
- 2 log lambda = / k log -------
---- ij C R
ij i j

--
R = > k
j -- ij
i

--
C = > k
i -- ij
j

--
N = > k
-- ij
ij

for the 2x2 case, this is

k N k N k N k N
11 12 21 22
- 2 log lambda = k ------ + k ------ + k ------ + k ------
11 C R 12 C R 21 R C 22 R C
1 1 1 2 2 1 2 2

where

C = k + k
1 11 21

C = k + k
2 12 22

R = k + k
1 11 12

R = k + k
2 21 22

and

N = k + k + k + k
11 12 21 22

i hope this helps. if there are other questions, please feel free to
ask. i can supply code in C to implement this statistic if it will
help.

>>>>> "dr" == Daniel Ridings <ridings@svenska.gu.se> writes:

dr> Ted Dunning writes about this in "Accurate Methods for the
dr> Statistics of Surprise and Coincidence" in Computational
dr> Linguistics Vol. 19, 1993.

dr> Could some kind soul walk me through the formula at the very
dr> end of the article, using the first bigram in Table 2 for
dr> illustration? The copy I have in front of me is a reprint in
dr> "Using Large Corpora" and there appears to be a misprint, so I
dr> won't repeat the formula here. (The misprint is in no way
dr> responsible for my dim wit).

dr> Daniel Ridings Gothenburg, Sweden