Re: authorship testing

Adam.Kilgarriff@itri.brighton.ac.uk
Fri, 2 Feb 1996 21:03:11 GMT

Ted says:

> the trick in author id is that you have to look only at the items
> which have nothing to do with content (and everybody assumes that the
> language is constant). this generally requires a considererable amount
> of human involvement in choosing the features of interest.

This seems rather pessimistic. How about working out which words have
the most stable frequency (low variance) across a lot of documents,
and then seeing which low variance words have different frequencies in
the texts of the two authors (you could weight the
difference-between-authors by the inverse of variance, or similar).
I've produced variance figures for word-pos pairs from the BNC, which
could be bodged into use for the task. Let me know if anyone's
interested. (If there's much interest I'll tidy them up and put them
on the net.)

By the way, Peter Fairley,

(1) how much disputed text do you have? (in electronic form?)
(2) how many candidate authors?
(3) how much undisputed text, of a similar genre to the
disputed text, in e-form, do you have for each candidate author?


Adam

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Research Fellow tel: (44) 1273 642919
Information Technology Research Institute (44) 1273 642900
University of Brighton fax: (44) 1273 606653
Lewes Road
Brighton BN2 4AT email: Adam.Kilgarriff@itri.bton.ac.uk
UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%