Re: Corpora: Sensible sizes for specialist corpora

eric@scs.leeds.ac.uk
Wed, 21 Jul 1999 08:04:49 +0100

Hello Mick,
this sounds like an interesting project. A couple of "first thing in the
morning" observations, which may not be of any value...
i) Geoff Leech, as you probably know, ran a semantic-tagging project a while
back; this semantic word-sense tagger may be relevant, though it wont
do everything you want
ii) similarly, your student might consider whether POS-tagging is likely to
have any relevance at all; I can't see this helping much, but if you
*can* see a use for PoS-tagging then try our email tagging service,
mail plain text to amalgam-tagger@scs.leeds.ac.uk or see
http://www.scs.leeds.ac.uk/amalgam/amalgam/amalgsoft.html
iii) (I can't count) to judge whether 240K (or any size) is "adequate",
why not try analysis of a much smaller subset, then progressively add
more text to your corpus and chart cumulative results to see if there is
a point where adding more text to your corpus does not change findings
significantly. For example, most of the POS-taggers offered by
amalgam-tagger were trained on c50,000 word subsets of the corpora, because
we noticed empirically that tagger accuracy didnt seem to improve much
with larger training sets. Of course this is less easy to measure in your
case - your description of the project seems to be "forst collect the data,
then decide how to analyse, then analyse, then draw conclusions from results";
to do this on progressivley larger subsets of the corpora, you have to
decide on analysis early on in the project and stick to it through
progressive cycles.
iv) sources - Lancaster presumably already has LOB (60s) and FLOB (modernish
equivalent of LOB), pity you cant reuse 2000-word samples from these
(they arent beginning, middle and end of a novel).

Regards,
Eric

Eric Atwell, Distributed Multimedia Systems MSc Tutor, SOCRATES Coordinator,
and Director, Centre for Computer Analysis of Language And Speech (CCALAS)
School of Computer Studies, University of Leeds, LEEDS LS2 9JT, England
EMAIL: eric@scs.leeds.ac.uk TEL: (44)113-2335430 FAX: (44)113-2335468
WWW: http://www.scs.leeds.ac.uk/scs/public/staff/eric.html