Cameron Smart's question is very interesting. My immediate reaction is
that if one is interested in the frequency of a bigram (sequence of two
words), the comparison would be with other bigrams, i.e. all other pairings
of immediately successive words in the corpora. The trouble is, though,
that the probabilities are not independent; if there is a case of
bigram X Y, then that makes it more likely that there will be a case of Y Z.
Is this the kind of failure of independence which can in practice be
ignored? My feeling for statistics is not strong enough to give an answer.
G.R. Sampson, Professor of Natural Language Computing
School of Cognitive & Computing Sciences
University of Sussex
Falmer, Brighton BN1 9QH, GB
e-mail geoffs@cogs.susx.ac.uk
tel. +44 1273 678525
fax +44 1273 671320
web http://www.grsampson.net
This archive was generated by hypermail 2b29 : Thu Dec 20 2001 - 12:51:13 MET