Re: Query: Letter frequencies for text identification

Ted Dunning (ted@crl.nmsu.edu)
Fri, 18 Aug 1995 16:02:42 -0600

X-Sender: gfowler@129.79.1.3
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Fri, 18 Aug 1995 16:23:53 -0500
From: gfowler@indiana.edu (George Fowler)
Sender: owner-corpora@lists.uib.no
Precedence: bulk

Greetings!
I am posting this inquiry for Sergei Atamas [ who is
interested in applying letter counting schemes from DNA to
text ]

Specifically, Sergei wonders if any Corpora subscribers could help
steer him to recent literature concerning text identification based on
letter frequencies.

actually, i think that the technology transfer has gone mostly
the other direction (from text processing to genetic sequence
analysis). a good example is the article that Owen White and I did
using software that I originally developed for the analysis of texts.
the reference is

White, Owen, Ted Dunning, Granger Sutton, Mark Adams, J.Craig Venter
and Chris Fields (1993). A quality control algorithm for DNA
sequencing projects. Nucleic Acids Research, 1993, Vol. 21, No. 16.

the quick description was that we were able to determine with 85-90%
accuracy whether very short sequences (300 base pairs) came from human
or yeast dna. as it turned out, a large french lab had used a
contaminated library to produce a large number of sequences that they
purported to be human. the effort using my software was the first to
show this contamination, but as the standard databases grew, a german
group was able to demonstrate the contamination by more conventaional
database search techniques.

This work was also described on 19 March 1993 in Science, Volume 259
Number 5102 in the squib entitled

"News and Comment" Genome shortcut leads to problems
Genome databases worry about yeast (and other) infections

(unfortunately science didn't mention the origin of the software).

a more detailed description of algorithms used as applied to the
problem of language identification can be found in my crl tech report

MCCS-94-273 - Statistical Identification of Language. Dunning,
Ted (1994)

i can provide this last article as well as the one referenced in the
NAR article in postscript form.

an interesting effort along these lines was done by a group in
michigan and reported in the second TREC proceedings and at SDAIR:

@inproceedings{cavnar,
author = {Cavnar, William B. and John M. Trenkle},
title = {N-Gram-Based Text Categorization},
booktitle = {1994 Symposium on Document Analysis and Information Retrieval in
Las Vegas},
year = {1994}
}

more recently, a similar method has been reinvented by marc damashek
of the department of defense. this work was reported recently in
science.

one method which came originally from the molecular biology community
to language processing is the dotplot diagram that ken church and
others have been using lately for parallel text alignment. this
method was reported in the molecular biology literature as far back as
the early 80's, but only recently has been noticed in the NLP
community.