Re: authorship testing

Ted Dunning (ted@crl.nmsu.edu)
Fri, 2 Feb 1996 12:49:48 -0700 (MST)

Last year I received an article describing a technique for categorizing
documents by style and/or language. I think the same should be true for
authorship. The title of the paper is:

N-Gram-Based Text Categorization
William B. Cavnar and John M. Trenkle
Environmental Research Institute of Michigan
P.O.Box 134001
Ann Arbor MI 48113-4001

i don't think that this will work well for authorship determination.
the trick in author id is that you have to look only at the items
which have nothing to do with content (and everybody assumes that the
language is constant). this generally requires a considerable amount
of human involvement in choosing the features of interest. with
n-gram approaches, it is sometimes a bit difficult to determine
*exactly* why a particular n-gram is highly represented; direct
examination and intuition are a bit misleading.

of course, it should be recognized that word and phrase frequencies
are simply variable length character n-grams, so in that sense the
n-gram methods are essentially equivalent to the mosteller and wallace
style methods.

i also (as everybody must know by now) feel that there are better
statistical approaches than the ones used be the ERIM group. but, the
best proof is in trying and trying lots of approaches is a good thing.

there is an extensive literature regarding authorship determination by
statistical or pseudo-statistical methods. most of it is garbage. if
you read mosteller and wallace and invent a little bit beyond that,
you will be as far along as most anybody.