Re: scaling/norming

Adam Kilgarriff (
Wed, 6 Dec 95 13:50:05 GMT


Hope this helps for how to apply the log-likelihood test to
your problem.

I found the formulation of Ted Dunning's test in Daille:95 (ref below)
easier to handle than the version in the CL paper (they are
equivalent). Daille uses the test to give a statistic which measures
the strength of bond between 2 words (as in a collocation).

First, take the contingency table for words x and y occurring with
each other or not

| y not-y
x | a b
not-x | c d

a + b = no. occurrences of x, a+c = no. occurrences of y, a+b+c+d=
number of words in the corpus

then the log-likelihood stat is

2[ aloga + blogb + clogc + dlogd
- (a+b)log(a+b) - (a+c)log(a+c)
- (b+d)log(b+d) - (c+d)log(c+d)
+ (a+b+c+d)log(a+b+c+d)]

This is a version of the formulae in Dunning (93) p 71. (Daille does
some interesting emprical checks to see which test for "termhood"
works best against a hand-annotated "gold-standard", and
log-likelihood comes out best, so there's empirical vindication for
the theoretically correct answer, for those of us who need such

To adapt the contingency table to your situation, eg, comparing two
corpora, x becomes "presence of feature", not-x becomes "absence of
feature" (and you'd need to establish how many non-occurrences of a
feature there were, so if the feature was a clausal feature, this
would be the number of *clauses* where it did not occur). y would
then be corpus-1 and not-y would be corpus-2.

Compute the stat and then look up in a set of chi-square tables (one
degree of freedom) to see whether the null hypothesis - that the
feature is equally common in each corpus - is rejected at, eg, 97.5% or
99.5% confidence level.

Here's a noddy awk program which computes the stat:

-----------cut here--------------

# for contingency table
# | y not-y
# ---------------
# x | a b
# not-x | c d
# form of input to this prog is a line with
# a b c d

stat= 2*(\
a*log(a) + b*log(b) + c*log(c) + d*log(d)\
- (a+b)*log(a+b) - (a+c)*log(a+c)\
- (b+d)*log(b+d) - (c+d)*log(c+d)\
+ (a+b+c+d)*log(a+b+c+d)\

print "log-likelihood is ", stat

----------end here------------

and here's what the program does (input line followed by output line)

10 1000 20 1000
log-likelihood is 3.34872
10 1000 30 1000
log-likelihood is 10.2689
1 1000 10 1000
log-likelihood is 8.50697
1 1000 1 10000
log-likelihood is 2.21309
1 1000 1 100000
log-likelihood is 6.47658
1 1000 6 1000
log-likelihood is 3.94998
1 1000 7 1000
log-likelihood is 5.0441

Critical chi-square value at 97.5% significance level (1 DF) is 5.02, so, if
we are in hypothesis-testing mode, we reject the null hypothesis (and
conclude that the feature does have different probabilities in the two
language-varieties of which the 2 corpora are samples) where
the stat is over that (or, over 7.88 if we want to use the 99.5% sig


author = "B\'{e}atrice Daille",
title = "Combined Approach for Terminology Extraction:
lexical statistics and linguistic filtering",
institution = "{UCREL}, Lancaster University",
year = 1995,
number = 5

Adam Kilgarriff tel: (44) 1273 642919
Research Fellow (44) 1273 642900
Information Technology Research Institute fax: (44) 1273 606653
University of Brighton
Lewes Road email:
Brighton BN2 4AT