Re: Corpora: Bilogarithmic type/token ratio

eric@scs.leeds.ac.uk
Mon, 15 Sep 1997 11:43:58 +0100

Alice,
your query caught my attention, because we are looking for similar metrics
for a very different application: assessing an unknown signal (e.g. from
outer space, of extra-terrestrial origin) to determine whether it constitutes
`language', and trying to delimit and recognise equivalents of characters,
words, phrases/sentences. At the simple level of determining whether
a binary sequence encodes characters, and how many bits are used for each
`character', it seems useful to look for a Zipfian type-token
distribution: we know this holds with English text for characters,
graphemes, stem-forms or lemmas, parts-of-speech, and grammatical substructures.
I suggest Zipfian type-token distributions should also hold at these levels
for other languages, EXCEPT that for highly-inflected languages the grapheme
curve will be shallower. So, to directly compare complexity of English and
Swedish text, you should compare word-stemmed or lemmatised text samples
instead of `raw' grapheme-sequences.

Is there anyone else out there using corpus-linguistics in the search for
extra-terrestrial language? If so, please CONTACT me ...

Eric Atwell, Centre for Computer Analysis of Language And Speech
School of Computer Studies, University of Leeds, LEEDS LS2 9JT, England
EMAIL: eric@scs.leeds.ac.uk TEL: (44)113-2335761 FAX: (44)113-2335468
WWW: http://agora.leeds.ac.uk/scs/public/staff/eric.html