Re: word frequency lists?

t-markl (t-markl@microsoft.com)
Tue, 28 Nov 95 16:34:17 PST

The recent discussions on word frequency lists have raised some
interesting points, although IMHO most of the disagreement seems
to stem from the different purposes supposed for word frequency lists.

Since the discussion has touched upon the use of 'balanced' corpora,
I have appended a relevant fragment from the draft of
my soon-to-be-completed dissertation. The topic of the thesis
concerns statistical language models, and in particular,
ones for compound noun disambiguation.

The fragment is sufficiently long to warrant a two-sentence summary:
For the purpose of training statistical language models, register variation
in the training data (like that found in balanced corpora) is generally
detrimental to performance, unless it can be captured by the model.
Since for most statistical models, we do not have access to sufficient
data to allow the model to condition on register, the best practice in
the short term is to train and test on text from a single register.

Naturally, any comments are appreciated (either by email
to t-markl@microsoft.com or to the list if they are of general
interest).

Best wishes to all,
Mark Lauer
Sydney, Australia

_______________________________________________________________

%
% This is a DRAFT only
%

\subsection{A Note on Register}
\label{sec:md_register}

The register of a text is a categorisation according to the
medium and situation in which it appears, and the importance of register
to linguistic phenomena did not escape the earliest corpus builders.
When Francis and Ku\v{c}era~(1982) built the Brown corpus in the
early sixties, they were very careful to compose it from a wide range of
sources. Since then a very large range of corpora of different kinds have
become available, some \scare{balanced} (for example, Brown and
\acronym{lob}), others pure (including AP Newswire and Grolier's encyclopedia)
and still others collected without particular regard to balancing sources
(the Penn Treebank is an instance, see Marcus~\etal,~1993).

Biber~(1993) has made extensive study of the cross-register variation in
language. For example, he has shown that within two subsets of the
LOB corpus, one collected from fictional works and the other from
expositions that there are marked differences in the distribution of
grammatical categories of many words. The word \lingform{given} is
used passively in~71\% of its uses within exposition, but only in~19\% of
its uses within fiction.
As Biber argues, this has \shortquote{important implications for
probabilistic taggers and parsing techniques, which depend on accurate
estimates of the relative likelihood of grammatical categories in
particular contexts}~(Biber,~1993:222). He also argues that
\shortquote{analyses must be based on a diversified corpus representing
a wide range of registers in order to be appropriately generalized to the
language as a whole, as in $\ldots$ a general purpose tagging
program}~(Biber,~1993:220). For linguistic and lexicographic purposes
it seems clear that so-called balanced corpora are required for
completeness; however, for the purposes of statistical language learning,
the moral is not so clear.

It certainly appears that cross-register variation is significant enough to
influence the performance of probabilistic taggers, so it is crucial that we
pay attention to it. However, it is not sufficient merely to use a balanced
corpus. Consider, for example, training a unigram tagger on a corpus
consisting of two parts, one fictional and the other from expositions. The
probability estimate for the passive tagging of the word \lingform{given}
will be based on training examples in both registers. If the word is
equally frequent in each half, then the expected estimate of this
probability is~0.45 (using the percentages above). This estimate is
wildly incorrect for both halves of the training corpus, and will likely
result in poor performance on this word. While an accuracy rate
of up to~76\% (on tagging the word \lingform{given}) is possible
if the unigram tagger is trained and then applied to each half
of the corpus separately, the best possible performance when
trained on the whole corpus is only~55\%.\footnote{These figures
are only upper bounds because they (falsely) assume that \lingform{given}
has only two possible tags.}

This example demonstrates that the notion of an average English may
well be a chimera and those who attempt to acquire it should
beware.\footnote{Even linguists should take some care. Clear~(1992)
argues a similar point for corpus-based studies in a lucid paper on the
abstract notion of the composition of langauge.} An examination of various
research work supports this view, with most of the highly successful
statistical language learning results having been achieved
using register-pure corpora. We must recognise that the parameters
of probabilistic models are generally dependent on the type of text
being modelled. If statistical models are to yield an accurate
picture of language, then separate distributions must be maintained
for different registers. In practice, however, this is going to
be difficult. We are already struggling to find sufficient data to train
probabilistic models; dividing the corpus further will only exacerbate the
problem.

In the short term, there is a better way to proceed: choose one particular
register, train using data only from that register, and accept that the
resulting systems are only (reliably) applicable to the same register. For
this, we need large register-pure corpora, which, luckily, are currently
available, at least for research. In the current work, encyclopedia entries
are used for both training and testing. In my view the use of a corpus of
uniform register is the correct method for applying probabilistic
modelling. Naturally, it is not possible to guarantee that the results will
be equally useful for processing a different register.

\bibitem{} Biber, Douglas
\newblock 1993.
\newblock Using Register-Diversified Corpora for General Language Studies.
\newblock {\em Computational Linguistics {\bf Vol. 19(2)}}, pp219-241.

\bibitem{} Clear, Jeremy
\newblock 1992.
\newblock Corpus Sampling.
\newblock In Leitner, Gerhard (ed.), {\em New Directions in English
Language Corpora}.
\newblock Mouton de Gruyter, Berlin.

\bibitem{} Francis, W.N. and Ku\v{c}era, H.
\newblock 1982.
\newblock {\em Frequency Analysis of English Usage: Lexicon and Grammar}.
\newblock Houghton-Mifflin, Boston.

\bibitem{} Marcus, M., Marcinkiewicz, M. A. and Santorini, B.
\newblock 1993.
\newblock Building a Large Annotated Corpus of English: The Penn Treebank.
\newblock {\em Computational Linguistics {\bf Vol 19(2)}}, pp313-330.