Re: Corpora: Sensible sizes for specialist corpora

Tony Berber Sardinha (tony4@uol.com.br)
Thu, 22 Jul 1999 10:27:10 -0300

Dear Mick

You might want to have a look at the sample sizes proposed by Doug Biber
for various linguistic features in:

Biber, D. (1990). Methodological issues regarding corpus-based analyses of
linguistic variation. *Literary and Linguistic Computing*, *5*, 257-269.;

Biber, D. (1993). Representativeness in corpus design. *Literary and
Linguistic Computing*, *8*, 243-257.

Since these articles don't include listings for many of the major
morphosyntactic and syntactic units, I used Biber's methodology to estimate
the size of corpora needed to represent the main POS categories in general
and specialist corpora and came up with the following figures:

POS | General corpus | Specialist corpus
Verb | 67,187 | 13,848
Noun | 74,551 | 8,555
Adj | 149,694 | 21,234
Adv | 205,206 | 68,953
Pron | 913,256 | 40,945
Num | 1,180,815 | 91,161

The claim would be that corpora as large as or larger than those above
would be representative samples of the POS categories. The most frequent
and evenly distributed categories require smaller samples. Also, the
recommended sample sizes for the specialist corpus are always much smaller,
suggesting a high degree of closure (McEnery & Wilson 1996). The general
corpus figures were based on the Brown, LOB, and the written component of
the BNC sampler. The specific corpus was a 11,761-word collection of
British job application letters.

cheers
tony
-------------------------------
Dr Tony Berber Sardinha
Catholic University of Sao Paulo, Brazil
tony4@uol.com.br
http://sites.uol.com.br/tony4/homepage.html
http://homepages.infoseek.com/~corpuslinguistics/homepage.html
-------------------------------

----------
> From: Short, Mick <m.short@lancaster.ac.uk>
> To: 'corpora@hd.uib.no'
> Subject: Corpora: Sensible sizes for specialist corpora
> Date: 19 July 1999 10:17
>
> I have a PhD student who wants to establish for her thesis a small corpus
of
> writings from serious fiction and popular fiction in order to investigate
> whether the claims made by critics about the linguistic differences in
the two
> genres has any basis in reality. Our current intention is (1) to
establish a
> corpus of two serious and two popular fiction novels (with an equalised
> division between male and female authors) for each of the two decades
from the
> 1950s to the 1990s (a total of 20 novels) and (2) to sample three
2,000-word
> samples from each (from the beginning, middle and end of each), thus
giving a
> total of 240,000 words. This matches roughly the size of my own speech,
thought
> and writing presentation corpus and Geoffrey Sampson's Susanne corpus.
>
> First questions: Is that in general terms adequate? Could it be any
smaller (my
> student is very worried that she won't cope analytically as most, if not
all,
> of the corpus will have to be analysed by hand)?
>
> The second general issue is that it presumably takes different sizes of
corpus
> to establish different sorts of claims. At present, from what we have
read, it
> looks as if we will need to establish the presence or absence of
statistically
> significant contrasts for:
>
> sentence/clausal length and complexity
> word complexity
> particular sorts of clausal constructions/syntactico-semantic
configurations
> (e.g. the ratio of passives, and of active clauses with parts of a
protagonist
> or abstract entities as subjects to dynamic verbs)
> the 'upgraded' use of action and speech verbs (e.g. using 'shouted'
rather than
> 'said')
> the prevalence of descriptions of characters' outward appearance,
clothing etc
> incidence of various sorts of speech and thought presentation
>
> Do you have any views on what size of sample would be needed to make safe
> judgements about these various factors?
>
> Would it be better to reduce the size of the samples taken from each
novel and
> increase the number of novels passages are extracted from?
>
> Do you know of any software which might be used to analyse texts
automatically
> in these respects?
>
> Do you know of an electronic versions of relevant novels which we might
be able
> to access?
>
> Any other comments or suggestions would be greatly appreciated.
>
> Mick Short
>
> Mick Short
> Professor of English Language and Literature,
> Department of Linguistics and Modern English Language,
> Lancaster University,
> Lancaster LA1 4YT,
> UK.
> Telephone: ((0)1524) 593035
> Fax: ((0)1524) 843085
> email: m.short@lancaster.ac.uk
> World Wide Web site: http://www.ling.lancs.ac.uk/
>
>
>