Corpora: Summary: Bilog type/token ratio

Alice Carlberger (alice@speech.kth.se)
Tue, 21 Oct 1997 13:57:34 +0200

Dear Corpora List Members:

Here is a summary of responses to the following query (from Sept 12)=
regarding=20
the bilogarithmic type/token ratio. Many thanks to Patrick Juola,=20
Alex Chengyu Fang, Donghoon Van Uytsel, Adam Kilgariff, Bill Fisher,=20
Mary E. Helt, Bob Krovetz, and Eric Atwell.

Best regards,

Alice Carlberger

ORIGINAL QUERY:
>As part of an effort to standardize the cross-product and cross-linguistic=
=20
>testing of word predictors, we are trying to build a multi-lingual test=
text
>corpus with texts of the same degree of complexity for each language. It=
seems
>that one possible measure of complexity would be the bilogarithmic=
type/token
>ratio, described in G. Herdans' "Type-Token Mathematics" and Henry Kucera=
and
>W. Nelson Francis' "Computational Analysis of Present-Day American=
English".
>And now I am wondering whether anyone could help us to figure out how (if=
=20
>possible) to use this ratio for cross-linguistic comparison, in our case=20
>especially the comparison between languages of different degrees of
inflection,=20
>e.g., English (little inflection) and Swedish (relatively high degree of=20
>inflection). Or could anyone suggest other measures of complexity, i.e.,=
style,
>that are more appropriate for cross-linguistic use? Any help in this matter=
=20
>would be greatly appreciated.
>

SUMMARY OF RESPONSES:

Patrick Juola <patrick.juola@psy.ox.ac.uk>

Used Kolmogorov complexity as the base for his metric.

Juola, Patrick. 1997. Measuring Linguistic Complexity : The
Morphological Tier, in Proceedings of the Third International
Conference on Quantitative Linguistics (QUALICO-97), Aug. 26-9,
Helsinki, Finland. pp. 98-9.=20

He should have a longer version out within a month for the Journal
of Quantitative linguistics.

-----------
Alex Chengyu Fang <alex@phonetics.ucl.ac.uk>

During the development of an efficient multi-lingual text alignment=20
algorithm, they found that the number of verbs seemed to be a fairly=
reliable=20
constant across European language sentence pairs that are mutual=20
translations. The indication came from English and Portuguese. See

J. Campbell, N. Chatterjee, A.C. Fang, and M. Manela. 1996. Improving=20
Automated Alignment in Parallel Corpora. In Language, Information and=20
Computation: PACLIC 11. ed. by B-S. Park and J-B. Kim. Language Education=20
and Research Institute, Kyung Hee University, Seoul, Korea, 1996. pp 63-72.

"What could be interesting is to investigate whether variations in the=
number=20
of verbs across two different texts (in two different European languages)=20
could be used as a pointer towards text complexity, though we do know that=
=20
this parameter indicates different genres of speech and writing in English.
See, for instance,

Fang, A.C. 1995. The distribution of infinitives in Contemporary British=20
English - a study based on the British ICE Corpus. In Oxford Literary &=20
Linguistic Computing, 10:4. pp 247-257.

Fang, A.C. Forthcoming. Verb Forms and Subcategorisations. In Oxford=20
Literary and Linguistic Computing, 12:4.

There are, of course, many other better measures but automatic processing=20
for part-of-speech information (tagging) is now relatively easy and=20
inexpensive to achieve. "

--------

"D.H. Van Uytsel" <Donghoon.VanUytsel@esat.kuleuven.ac.be>
D.H. Van Uytsel (016)32.1859 http://www.esat.kuleuven.ac.be/~donghoon

Is planning to do a comparative study between written English and Dutch,
from the (engineering) viewpoint of stochastic modeling for speech
recognition, starting 1998.

"So far, I've not yet browsed the literature about this carefully. I am
convinced that there is no easy measure for "text complexity". Most
straightforward would be to take humans (native speakers for each
language), and record scores of a repeated "guess-which-word-follows"
game.

In large-vocabulary speech-recognition research, the measure "perplexity"*
is quite common. It is a good indicator for the difficulty of a certain
domain (language/application/vocabulary) to automatically recognize. The
disadvantage is, we don't know yet how it relates to the genuine
perplexity (I mean, for humans), and which factors influence its
behaviour. Also, it is always a measure with respect to a certain language
model (e.g. a back-off word trigram). "

*F. Jelinek and R.L. Mercer, Interpolated Estimation of
Markov Source Parameters from Sparse Data, in E.S. Geltsema and L.N. Kanal
(eds), "Pattern Recognition in Practice", North Holland 1980, Amsterdam.

------

Adam.Kilgarriff@itri.brighton.ac.uk (Adam Kilgarriff)
http://www.itri.bton.ac.uk/~Adam.Kilgarriff

Kilgarriff's review of Doug Biber's 95 book would serve as a good basis=20
for finding equivalent genres.

"Using word frequency lists to measure corpus homogeneity and=20
similarity between corpora." Has just appeared in Proc Wkshop on
Very Large Corpora, China, Aug 97. See homepage.

"One candidate approach would be to use perplexity (there's a set of
tools at CMU that do all the sums for you and look quite easy to
install). I don't have a clear intuition about whether this would
address the inflections issue, it might well."

the Oslo/Bergen project on parallel and 'stylistically-matched' corpora=20

His own work on measures for corpus similarity:
"I've only looked at things monolingually but I can envisage ways of
adapting it to a cross-lingual measure." =20

------

Bill Fisher <william.fisher@nist.gov>

Suggests using the "corpus perplexity" (aka "test set perplexity")
([which is] "a measure of complexity that is very popular among=20
researchers in automatic speech recognition, since it's pretty=20
straightforward to calculate and correlates strongly with the=20
percentage of errors that speech recognizers generally make. =20
While it's usually used to measure how good a statistical=20
language model is at predicting the word strings in a test=20
set of sentences (a corpus), if you hold the language model=20
constant, it can also be used to calibrate the complexity of=20
the corpus."

"Roughly speaking, it's the average number of word choices
the language model allows you when recognizing (or building)
the sentences of a corpus, modeling your actions as "first
pick the first word; then, given that, pick the second; then,
given that, pick the third ...". It's always calculated
relative to a given language model, which is typically a
statistical 2-gram or 3-gram Markovian one."

"There's a discussion of it in the recent book "Corpus-Based
Methods in Language and Speech Processing", ed. Steve Young
and Gerrit Bloothooft, Kluwer, 1997, ISBN 0-7923-4463-4,
p. 178 ff. And a handy toolkit to calculate it (and the
statistical LM that it needs) is available from CMU and
Cambridge; see http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html."

"One disadvantage of it is that it requires a large corpus
of sentences to train up the language model that is used
in its calculation. But on the other hand, many such
corpora have recently been made available by the LDC.
for doing cross-language work, you would have to try to
derive statistical language models that represent the
languages about equally well. You might think that another
disadvantage is the fact that the brain-dead Markovian
language modeling can't get at the real essence of the
language, but you'd be surprised how much of the syntactic,
semantic, and pragmatic constraint is captured in a few
words of immediate context."

-------
Marie E. Helt <meh2@dana.ucc.nau.edu>

Biber (1995) "does a factor analysis on the frequency of a wide range of=20
linguistic features in Somali, Korean, Tuvaluan and English. Differences=20
among the languages vary, and are shown along five dimensions of variation.=
=20
This approach might help you identify which linguistic features co-occur=20
in more complex texts across languages."

-----
Bob Krovetz <krovetz@research.nj.nec.com>

Gustav Herdan, Lecturer in Statistics, University of Bristol, "Type-Token
Mathematics: A Textbook of Mathematical Linguistics", 1960, Mouton & Co.,
's-Gravenhage, Janua Linguarum: Studia memoriae, Nicolai van Wijk Dedicata
edenda curat Cornelis H. can Schooneveld, Leiden, Series Maior IV, 1960,=20
Mouton & Co., 's-Gravenhage.

------
Eric Atwell <eric@scs.leeds.ac.uk> =20
Centre for Computer Analysis of Language And Speech
School of Computer Studies, University of Leeds, LEEDS LS2 9JT, England
TEL: (44)113-2335761 FAX: (44)113-2335468
WWW: http://agora.leeds.ac.uk/scs/public/staff/eric.html

..."we are looking for similar metrics
for a very different application: assessing an unknown signal (e.g. from=20
outer space, of extra-terrestrial origin) to determine whether it=
constitutes
`language', and trying to delimit and recognise equivalents of characters,
words, phrases/sentences. At the simple level of determining whether=20
a binary sequence encodes characters, and how many bits are used for each
`character', it seems useful to look for a Zipfian type-token=20
distribution: we know this holds with English text for characters,=20
graphemes, stem-forms or lemmas, parts-of-speech, and grammatical=
substructures.
I suggest Zipfian type-token distributions should also hold at these levels
for other languages, EXCEPT that for highly-inflected languages the grapheme
curve will be shallower. So, to directly compare complexity of English and
Swedish text, you should compare word-stemmed or lemmatised text samples
instead of `raw' grapheme-sequences.=20

-------

----------------------------------------------------------------------------
Alice Carlberger E-mail: alice@speech.kth.se
KTH (Royal Institute of Technology) Phone: +46 8 790 75 62
TMH (Dept. of Speech, Music and Hearing) Fax: +46 8 790 78 54
Drottning Kristinas v=E4g 31
S-100 44 Stockholm
Sweden
---------------------------------------------------------------------------