Re: scaling/norming

Adam Kilgarriff (ak28@it-research-institute.brighton.ac.uk)
Wed, 6 Dec 95 13:50:05 GMT

John,

Hope this helps for how to apply the log-likelihood test to
your problem.

I found the formulation of Ted Dunning's test in Daille:95 (ref below)
easier to handle than the version in the CL paper (they are
equivalent). Daille uses the test to give a statistic which measures
the strength of bond between 2 words (as in a collocation).

First, take the contingency table for words x and y occurring with
each other or not

| y not-y
---------------
x | a b
not-x | c d

a + b = no. occurrences of x, a+c = no. occurrences of y, a+b+c+d=
number of words in the corpus

then the log-likelihood stat is

2[ aloga + blogb + clogc + dlogd
- (a+b)log(a+b) - (a+c)log(a+c)
- (b+d)log(b+d) - (c+d)log(c+d)
+ (a+b+c+d)log(a+b+c+d)]

This is a version of the formulae in Dunning (93) p 71. (Daille does
some interesting emprical checks to see which test for "termhood"
works best against a hand-annotated "gold-standard", and
log-likelihood comes out best, so there's empirical vindication for
the theoretically correct answer, for those of us who need such
reassurance;-).)

To adapt the contingency table to your situation, eg, comparing two
corpora, x becomes "presence of feature", not-x becomes "absence of
feature" (and you'd need to establish how many non-occurrences of a
feature there were, so if the feature was a clausal feature, this
would be the number of *clauses* where it did not occur). y would
then be corpus-1 and not-y would be corpus-2.

Compute the stat and then look up in a set of chi-square tables (one
degree of freedom) to see whether the null hypothesis - that the
feature is equally common in each corpus - is rejected at, eg, 97.5% or
99.5% confidence level.

Here's a noddy awk program which computes the stat:

-----------cut here--------------

# for contingency table
#
# | y not-y
# ---------------
# x | a b
# not-x | c d
#
# form of input to this prog is a line with
# a b c d

{
a=$1
b=$2
c=$3
d=$4
stat= 2*(\
a*log(a) + b*log(b) + c*log(c) + d*log(d)\
- (a+b)*log(a+b) - (a+c)*log(a+c)\
- (b+d)*log(b+d) - (c+d)*log(c+d)\
+ (a+b+c+d)*log(a+b+c+d)\
)

print "log-likelihood is ", stat
}

----------end here------------

and here's what the program does (input line followed by output line)

10 1000 20 1000
log-likelihood is 3.34872
10 1000 30 1000
log-likelihood is 10.2689
1 1000 10 1000
log-likelihood is 8.50697
1 1000 1 10000
log-likelihood is 2.21309
1 1000 1 100000
log-likelihood is 6.47658
1 1000 6 1000
log-likelihood is 3.94998
1 1000 7 1000
log-likelihood is 5.0441

Critical chi-square value at 97.5% significance level (1 DF) is 5.02, so, if
we are in hypothesis-testing mode, we reject the null hypothesis (and
conclude that the feature does have different probabilities in the two
language-varieties of which the 2 corpora are samples) where
the stat is over that (or, over 7.88 if we want to use the 99.5% sig
level).

ref:

@TechReport{Daille:95,
author = "B\'{e}atrice Daille",
title = "Combined Approach for Terminology Extraction:
lexical statistics and linguistic filtering",
institution = "{UCREL}, Lancaster University",
year = 1995,
number = 5
}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff tel: (44) 1273 642919
Research Fellow (44) 1273 642900
Information Technology Research Institute fax: (44) 1273 606653
University of Brighton
Lewes Road email:
Brighton BN2 4AT ak28@itri.bton.ac.uk
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%