Corpora: Sensible sizes for specialist corpora

Adam Kilgarriff (Adam.Kilgarriff@itri.brighton.ac.uk)
Wed, 21 Jul 1999 10:24:26 +0100

Mick Short writes:

> thus giving a total of 240,000 words...
>
> First questions: Is that in general terms adequate? Could it be any smaller ...
>
> The second general issue is that it presumably takes different sizes of corpus
> to establish different sorts of claims...

Very interesting and much under-researched questions. Most of the
items on your list are 'grammatical' rather then 'lexical' so that helps.
The critical consideration is, for each
of the things you want to count, how many instances are there.
Consider eg

> abstract entities as subjects to dynamic verbs

If you have less than, say, 50 instances of this, it would be rash to
draw any conclusions. Also, for the conclusion to be of interest, you
would need the instances NOT to all come from the same sample, as then
it would only tell you about that sample, not the genre.

Also, the required sample size depends critically on the homogeneity
of the corpus. The more varied, the more data you need. Yours will
be all fiction, so far more homogeneous
than (say) the BNC, and that helps, but fiction isn't per se very
homogeneous and you have a very complex sampling frame, with

- author gender (2)
- decade of writing (5)
- beginning/middle/end (3)
- serious/popular (2)

as parameters. You only have a sample size of 1 for each
subdivision of your sampling space. This looks problematic to me. An
awful lot will hang on which book you choose.
While both 2000 wds/sample and 240,000 words total sound fine, it
looks to me that you might want to think about the number of
parameters, and drop some, making the sample space simpler. If, say,
you only looked at the 50s and 90s, and only at female authors, you
would have 5 novels for each cell in the sampling space, which rings
some bells for me as the minimal acceptable from a statistics
perspective. You would
get a much smaller set of statistical hypotheses to test (eg "50s vs
90s", not "50s vs 60s, 50s, vs 70s, ...") but, for
each of the ones you could test, you would have a much higher chance
of having enough evidence to say something valid. The student should
have a long think about which hypotheses she is most interested in.

I've written some stuff on comparing corpora, which you could use to
test, eg, whether the 1950s-serious stuff was more like the
1950s-popular or the 1990s-serious (and how homogeneous each was).
See my home page (below) or email me if you're interested.

Adam Kilgarriff

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Senior Research Fellow tel: (44) 1273 642919
Information Technology Research Institute (44) 1273 642900
University of Brighton fax: (44) 1273 642908
Lewes Road
Brighton BN2 4GJ email: Adam.Kilgarriff@itri.bton.ac.uk
UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%