Corpora: Cameron Smart's q about chi-squared test for bigrams

From: Geoffrey Sampson (geoffs@cogs.susx.ac.uk)
Date: Thu Dec 20 2001 - 12:45:29 MET

Next message: Joaquim Ferreira da Silva: "Corpora: State of the art in terminology extraction"

Previous message: Geoffrey Sampson: "Corpora: apostrophes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Cameron Smart's question is very interesting. My immediate reaction is
that if one is interested in the frequency of a bigram (sequence of two
words), the comparison would be with other bigrams, i.e. all other pairings
of immediately successive words in the corpora. The trouble is, though,
that the probabilities are not independent; if there is a case of
bigram X Y, then that makes it more likely that there will be a case of Y Z.
Is this the kind of failure of independence which can in practice be
ignored? My feeling for statistics is not strong enough to give an answer.

G.R. Sampson, Professor of Natural Language Computing

School of Cognitive & Computing Sciences
University of Sussex
Falmer, Brighton BN1 9QH, GB

e-mail geoffs@cogs.susx.ac.uk
tel. +44 1273 678525
fax +44 1273 671320
web http://www.grsampson.net

Next message: Joaquim Ferreira da Silva: "Corpora: State of the art in terminology extraction"
Previous message: Geoffrey Sampson: "Corpora: apostrophes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Dec 20 2001 - 12:51:13 MET