Re: Corpora: Bilogarithmic type/token ratio

Bill Fisher (william.fisher@nist.gov)
Fri, 12 Sep 1997 09:26:32 -0400

On Sep 12, 8:56am, Alice Carlberger wrote:
> Subject: Corpora: Bilogarithmic type/token ratio
> Dear Corpora List subscribers:
>
> As part of an effort to standardize the cross-product and cross-linguistic
> testing of word predictors, we are trying to build a multi-lingual test text
> corpus with texts of the same degree of complexity for each language. It
seems
> that one possible measure of complexity would be the bilogarithmic type/token
> ratio, described in G. Herdans' "Type-Token Mathematics" and Henry Kucera and
> W. Nelson Francis' "Computational Analysis of Present-Day American English".
> And now I am wondering whether anyone could help us to figure out how (if
> possible) to use this ratio for cross-linguistic comparison, in our case
> especially the comparison between languages of different degrees of
inflection,
> e.g., English (little inflection) and Swedish (relatively high degree of
> inflection). Or could anyone suggest other measures of complexity, i.e.,
style,
> that are more appropriate for cross-linguistic use? Any help in this matter
> would be greatly appreciated.
>
...

May I suggest that you consider the "corpus perplexity"
(aka "test set perplexity")? That's a measure of complexity
that is very popular among researchers in automatic speech
recognition, since it's pretty straightforward to calculate
and correlates strongly with the percentage of errors that
speech recognizers generally make. While it's usually used
to measure how good a statistical language model is at
predicting the word strings in a test set of sentences
(a corpus), if you hold the language model constant, it
can also be used to calibrate the complexity of the corpus.

Roughly speaking, it's the average number of word choices
the language model allows you when recognizing (or building)
the sentences of a corpus, modeling your actions as "first
pick the first word; then, given that, pick the second; then,
given that, pick the third ...". It's always calculated
relative to a given language model, which is typically a
statistical 2-gram or 3-gram Markovian one.

There's a discussion of it in the recent book "Corpus-Based
Methods in Language and Speech Processing", ed. Steve Young
and Gerrit Bloothooft, Kluwer, 1997, ISBN 0-7923-4463-4,
p. 178 ff. And a handy toolkit to calculate it (and the
statistical LM that it needs) is available from CMU and
Cambridge; see http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html.

One disadvantage of it is that it requires a large corpus
of sentences to train up the language model that is used
in its calculation. But on the other hand, many such
corpora have recently been made available by the LDC.
for doing cross-language work, you would have to try to
derive statistical language models that represent the
languages about equally well. You might think that another
disadvantage is the fact that the brain-dead Markovian
language modeling can't get at the real essence of the
language, but you'd be surprised how much of the syntactic,
semantic, and pragmatic constraint is captured in a few
words of immediate context.

- Bill Fisher