Corpora: Sensible sizes for specialist corpora

Short, Mick (m.short@lancaster.ac.uk)
Mon, 19 Jul 1999 14:17:10 +0100

I have a PhD student who wants to establish for her thesis a small corpus of
writings from serious fiction and popular fiction in order to investigate
whether the claims made by critics about the linguistic differences in the two
genres has any basis in reality. Our current intention is (1) to establish a
corpus of two serious and two popular fiction novels (with an equalised
division between male and female authors) for each of the two decades from the
1950s to the 1990s (a total of 20 novels) and (2) to sample three 2,000-word
samples from each (from the beginning, middle and end of each), thus giving a
total of 240,000 words. This matches roughly the size of my own speech, thought
and writing presentation corpus and Geoffrey Sampson's Susanne corpus.

First questions: Is that in general terms adequate? Could it be any smaller (my
student is very worried that she won't cope analytically as most, if not all,
of the corpus will have to be analysed by hand)?

The second general issue is that it presumably takes different sizes of corpus
to establish different sorts of claims. At present, from what we have read, it
looks as if we will need to establish the presence or absence of statistically
significant contrasts for:

sentence/clausal length and complexity
word complexity
particular sorts of clausal constructions/syntactico-semantic configurations
(e.g. the ratio of passives, and of active clauses with parts of a protagonist
or abstract entities as subjects to dynamic verbs)
the 'upgraded' use of action and speech verbs (e.g. using 'shouted' rather than
'said')
the prevalence of descriptions of characters' outward appearance, clothing etc
incidence of various sorts of speech and thought presentation

Do you have any views on what size of sample would be needed to make safe
judgements about these various factors?

Would it be better to reduce the size of the samples taken from each novel and
increase the number of novels passages are extracted from?

Do you know of any software which might be used to analyse texts automatically
in these respects?

Do you know of an electronic versions of relevant novels which we might be able
to access?

Any other comments or suggestions would be greatly appreciated.

Mick Short

Mick Short
Professor of English Language and Literature,
Department of Linguistics and Modern English Language,
Lancaster University,
Lancaster LA1 4YT,
UK.
Telephone: ((0)1524) 593035
Fax: ((0)1524) 843085
email: m.short@lancaster.ac.uk
World Wide Web site: http://www.ling.lancs.ac.uk/