Corpora: Adequate size for a specialist corpora -- again

Gordon and Pam Cain (gpcain@rivernet.com.au)
Tue, 07 Sep 1999 19:56:06 +1000

I am doing research with a very small specialist corpus, divided into
two sub-corpora. I have collected in electronic format 14 essays of
around 1,000 words, each written by business students. The question they
answered was much more highly constrained than is usual in academic
work, so there is little variation in subject matter.

I aim to compare the lexical features of the more successful papers
against the less successful papers (as defined by the mark received).
Fortunately this results in two subcorpora of 7 essays each, of the
following sizes:

high: ~7700 tokens over 7 essays
low: ~8500 tokens over 7 essays
Total: 16182 tokens; 101 kilobytes
Each essay is by a separate author.

One of my aims is to compare these sub-corpora against each other. I
realise that a lot more data would be desireable; however, this is all
that was forthcoming. (Corpora size was a thread in July, and Tony
Berber-Sardinha posted some stats on min size to represent the main POS
categories in specialist corpora. However, I think that this is somewhat
different
from representing POS categories, if I understand that aright.)

My questions are:
1. Is this sample size adequate for anything? Would it be valid even to
state that the results are suggestive or indicative? (Already, playing
with this data using WordSmith has turned up some interesting and
consistent lexical differences between the sub-corpora, that hold across
most of the papers.)

2. If I focus on high-frequency lexis that is found across most of the
texts (or most of the texts in one of the subcorpora) can I then use a
smaller corpus to get valid results than for less-frequent items? After
all, I am trying to characterise only some aspects of the lexis of more
and less successful writing within a genre, not conduct exhaustive
analysis.

3. At what point should I stop thinking that there may be some validity
in my results?

4. As a last resort, should I collect more data from a different but
similar assignment, and run parallel analyses on that data too? After
all, if tests on two corpora of less-than-ideal size produce similar
results, wouldn't that tend to confirm the validity of my findings?

5. Anyone know any brilliant articles on this topic for a self-educated
corpus fan like me?

Thanks in advance for all help.
Gordon

-- 
Gordon Cain, Teacher of ESOL
TAFE International Education Centre, Liverpool
Sydney, Australia
gpcain@rivernet.com.au