Corpora: MI and n-grams

John Milton (lcjohn@uxmail.ust.hk)
Sat, 25 Apr 1998 18:23:22 +0800 (HKT)

Almost two months ago I asked for pointers to work on applying Mutual
Information or similar concepts to the collocational properties of n-grams
longer than 2. Thanks to those who replied, I hope the following
references are useful to others:

Michael Barlow:
Kita et al have a paper in the TALC 94 proceedings. They refer to Jelinek
for a suggestion about MI for > 2.
Jelinek, F. 1990. Self-Organized Language Modeling for Speech Recognition.
In A. Waibe; and K.F. Lee (eds) Readings in Speech Recognition.
...............
Tony Rose:
TG Rose, LJ Evett & MJ Lee (1994) "Contextual analysis for text
recognition: a comparison with human performance", AISB Quarterly,
Summer 1994.
-- In this we investigate a number of statistical measures including the
association ratio, which is derived from mutual information but
applied so that word order constraints are observed. All the measures
are applied across a 'window' of four words, thus including n-grams
of length 2, 3 and 4.
...............
Lluis Padro:
In my PhD I used MI computed for any context pattern, either bigrams,
trigrams, or any other context relation. you can get it from cmp-lg or
either from the "research" link in my web page
http://www.lsi.upc.es/~padro
...............
Nicolas Turenne:
Mikheev,A Finch,S "Towards a workbench for acquisition of domain
knowledge" 1992
Lelu,A Halleb,M Delprat,B "Recherche d'information et cartographie dans
des corpus textuels à partir des fréquences de n-grammes" à paraître
JADT'98 1998
................
Philip Resnik:
David Magerman and Mitch Marcus (1990, "Parsing a Natural Language
Using Mutual Information Statistics", Proceedings of AAAI) define
"Generalized Mutual Information" over n-grams.
.....................
Dan Melamed:
Dan Melamed, Automatic Discovery of Non-Compositional Compounds
in Parallel Data, 2nd Conference on Empirical Methods in Natural Language
Processing (EMNLP'97), Providence, RI, 1997.
( download from http://www.cis.upenn.edu/~melamed/)
....................
Paul Rayson:
UCREL technical paper number 5: Combined Approach for Terminology
Extraction: Lexical Statistics and Linguistic Filtering. Beatrice Daille.
....................
Ole Norling-Christensen:
John Sinclair's "Corpus, Concordance, Collocation" (OUP 1991),
pages 105-106, a metod was outlined. It was described in more
detail by JS in the deliverable D of the EC funded project MECOLB
(MLAP93-21) and a program called TYPICAL was presented. Based on this
description I made a similar program for Danish, which is used on a
regular basis by the Danish Dictionary project and is very useful. A
regular paper is pending:
John Sinclair, Oliver Mason, Jackie Ball and Geoff Barnbrook
(1997) ``Language Independent Statistical Software for Corpus Exploration''
[to appear in CompHum]
.......................
Kenneth W. Church:
Mikio Yamamoto, Residual IDF: like Mutual Information, but Different
(paper submitted to the upcoming Coling)