Corpora: Cameron Smart's q about chi-squared test for bigrams

From: Geoffrey Sampson (geoffs@cogs.susx.ac.uk)
Date: Thu Dec 20 2001 - 12:45:29 MET

  • Next message: Joaquim Ferreira da Silva: "Corpora: State of the art in terminology extraction"

    Cameron Smart's question is very interesting. My immediate reaction is
    that if one is interested in the frequency of a bigram (sequence of two
    words), the comparison would be with other bigrams, i.e. all other pairings
    of immediately successive words in the corpora. The trouble is, though,
    that the probabilities are not independent; if there is a case of
    bigram X Y, then that makes it more likely that there will be a case of Y Z.
    Is this the kind of failure of independence which can in practice be
    ignored? My feeling for statistics is not strong enough to give an answer.

    G.R. Sampson, Professor of Natural Language Computing

    School of Cognitive & Computing Sciences
    University of Sussex
    Falmer, Brighton BN1 9QH, GB

    e-mail geoffs@cogs.susx.ac.uk
    tel. +44 1273 678525
    fax +44 1273 671320
    web http://www.grsampson.net



    This archive was generated by hypermail 2b29 : Thu Dec 20 2001 - 12:51:13 MET