Re: Authorship Testing

Paul Holmes-Higgin (paul@inke.com)
Fri, 16 Feb 1996 15:54:46 +-100

Sorry for the time lag from the thread, I've been out
and about a bit...

There is a whole discipline that has been looking at
authorship attribution, called Stylometrics (which
charts its origins back to the end of the last century).
A large number of metrics have been proposed, many very
relevant for corpus analysis. A chunk of references
follow, blatantly including my thesis. Recent work has
used neural networks to "evaluate" authorship, albeit with
rather simple networks, looking at the Federalist Papers
as well as Shakespeare. The approach here has been to
compare occurrences of particular words, such as
did/did+do. Some work (Ledger's on Plato) also considered
chronological change in style of an author (he used
multivariate analysis).

As for tools, we use our own System Quirk to generate the
required numbers: in the current incarnation of the text
analysis tool, KonText, you can ask for quite complex
equations to be calculated (e.g. for Honore's vocabulary
richness function =100*log($tokens)/(1-$v1/$vocabulary) ).
System Quirk also allows add-on functions/services to be
defined, and we hooked in a neural net component so that
the net could be trained and queried directly from KonText.
The add-on allows you to define a set of did/did+do type
patterns that it then gets the frequencies for, builds an
appropriately sized neural network and trains it. In
testing mode the network can be used to say whether
the text being analysed is by the same author (or on what
ever criteria the training texts had been selected).
We are planning a "teaching" version of System Quirk
under MS Windows soon - details haven't been finalised,
but it will probably comprise the Virtual Corpus Manager,
KonText and Browser/Refiner (the term bank/lexicon editor).

Paul.

---
Paul Holmes-Higgin             paul@inke.com
Language Engineering Manager
Information and Knowledge Environments (InKE) Ltd
40 Occam Road
Surrey Research Park
GUILDFORD  GU2 5YG
England        [ Tel: +44 1483 259744     Fax: +44 1483 259745 ]

====================================================== References:

Bailey, Richard W., (1979). Authorship attribution in a forensic setting. D.E. Aeger, F.E.Knowles & J.Smith (Eds). Advances in Computer-Aided Literary and Linguistic Resaerch. Proc. 5th Intl. Symposium on Computers in Literary and Linguistic Research, Birmingham, 1978.

Brainerd, B., (1974). Weighing evidence in language and literature: a statistical approach. University of Toronto Press. Toronto.

Brainerd, B. (1988). Two models for the type-token relation with time dependant vocabulary reservoir. P. Thoiron, D. Serant & D. Labbe (Eds.). Vocabulary structure and lexical richness. Champion-Slatkine. Paris.

de Morgan, Sophia, E. (1882). Memoir of Augustus de Morgan by his wife Sophia Elizabeth de Morgan with selections from his letters. Longmans, Green, and Co. London.

Ellegard, A.A. (1962). A Statistical Method for Determining Authorship: The Janus Letters, 1769-1772. University of Gothenburg. Gothenburg.

Fucks, W. (1952). On the mathematical analysis of style. Biometrika 39. pp. 122-129.

Grayston, K. & Herdan, G. (1959). The authorship of the Pastorals in the light of statistical linguistics. New Testament Studies 6. pp. 1-15.

Holmes, David, I. (1994). Authorship attribution. Computers and the Humanities 28. pp. 87-106.

Holmes-Higgin, P.R. (1995). Text Knowledge: the Quirk Experiments. Ph.D. Thesis, Dept. Mathematical and Computing Sciences, University of Surrey, Guildford, England.

Honore, A. (1979). Some Simple Measures of Richness of Vocabulary. Association for Literary and Linguistic Computing Bulletin 7. pp. 172-177.

Hubert, P. & Labbe, D. (1988). A model of vocabulary partition. Journal of the Association for Literary and Linguistic Computing 3. pp. 223-225.

Ledger, Gerard R. (1989). Re-counting Plato: A Computer Analysis of Plato's Style. Clarendon Press. Oxford.

Matthews, Robert & Merriam, Thomas V.N. (1993). Neural computation in stylometry I: an application to the works of Shakespeare and Marlowe. Literary and Linguistic Computing 8 (4). pp. 203-209.

Matthews, Robert & Merriam, Thomas V.N. (1994). Neural computation in stylometry II: an application to the works of Shakespeare and Marlowe. Literary and Linguistic Computing 9 (1). pp. 1-6.

Mendenhall, Thomas C. (1887). The Characteristic Curves of Composition. Science IX. pp. 237-249.

Ratkowsky, D.A. & Hantrais, L. (1975). Tables for comparing the richness and structure of vocabulary in texts of different lengths. Computers and the Humanities 9. pp. 69-75.

Sichel, H.S. (1974). On a distribution representing sentence-length in written prose. Journal of the Royal Statistical Society (A) 137. pp. 25-34.

Sichel, H.S. (1986). Word frequency distributions and type-token characteristics. Mathematical Scientist 11. pp. 45-72.

Tweedie, Fiona J., Singh, S. & Holmes, David I. (1994). Neural Network Applications in Stylometry: The Federalist Papers. Proceedings of the 3rd Conference on the Cognitive Science of Natural Language Processing, Dublin City University, Dublin, 7-8 July 1994.

Yule, George U. (1938). On sentence-length as a statistical characteristic of style in prose, with application to two cases of disputed authorship. Biometrika 30. pp. 363-390.

Yule, George U. (1944) The statistical study of literary vocabulary. Cambridge University Press. Cambridge.