Query: Letter frequencies for text identification

George Fowler (gfowler@indiana.edu)
Fri, 18 Aug 1995 16:23:53 -0500

I am posting this inquiry for Sergei Atamas
(satamas@umabnet.ab.umd.edu), a research associate at the University of
Maryland at Baltimore. His field is molecular biology, and his work
involves comparing DNA strings using various algorithms. I don't understand
the details well enough to pass them along. At any rate, one such algorithm
relies upon frequencies with which the letters G, A, T, and C occur in the
DNA strings. He would like to explore the analogous use of letter (sound)
frequencies in natural language texts. Hence this posting.
Specifically, Sergei wonders if any Corpora subscribers could help
steer him to recent literature concerning text identification based on
letter frequencies. Any suggestions could be sent directly to him at the
above address, or to me and I'll pass them along. He would also be
interested in collaborative work if this research connects with the work of
any linguists or text processing specialists. He observes that very often
work in one field would actually help work in a far-removed field, if only
people knew what was going on over there.
