summary on student corpus

colber@mbm1.scu.edu.tw
Thu, 26 Jun 1997 06:54:15 +0800

Dear List subscribers working with student corpora,

About a month ago, I posted a message asking for information/guidance on
English student corpora. This is the summary of the responses.

Valuable information has been sent by

Jochen Leidner <leidner@linguistik.uni-erlangen.de>
Marianne Hundt <hundt@ruf.uni-freiburg.de>
Pieter de Haan <P.deHaan@let.kun.nl>
Pascual Cantos <pcantos@fcu.um.es>
Tao, Hongyin <chstaohy@leonis.nus.sg>
Antoinette Renouf <ant@rdues.liv.ac.uk>
G. Nelson <uclegen@ucl.ac.uk>
George C. Demetriou <george@scs.leeds.ac.uk>
Su-hsun Tsai <teemsht@ioe.ac.uk>
Gui Shichun <itscgui@scut.edu.cn>
Kojiro Asao <kojiasao@keyaki.cc.u-tokai.ac.jp>
Oliver Strunk <strunk@lingua.fil.ub.es>
John Milton <lcjohn@uxmail.ust.hk>
Joel Walters <waltej@ashur.cc.biu.ac.il>
Craig McL. Wallace <craigwal@iconz.co.nz>
Jeff Williamson <NVWILLJ@NV.CC.VA.US>
David Keith MURPHY <aircai@worldnet.fr>
Suzanne E Kemmer <kemmer@ruf.rice.edu>
Bas Aarts <b.aarts@ucl.ac.uk>
E.J. Adolphson <ejaz8d1@cat.com>
Ronald W. Long <rwlong@iland.net
David M Lewis <lewis@iaehv.nl> and
Vance Stevens <102005.65@COMPUSERVE.COM>

from which I personally learned much and am grateful for.

(1) Regarding the (electronic) FORMAT in which the corpus is typed,
collected (pasted) and saved -- many respondents pointed out -- the only
important requirement is that it should be a "text only" ASCII file, or
group of files in a single directory. Whether there are line and paragraph
breaks, extra headings with file/code numbers added, etc. does not affect
the results of linguistic research. Concordancers and other text analyzing
applications can all handle such files.

(2) The corpus itself, for the same reason, can be either in the format
of ONE FILE or a (long) SERIES OF FILES. Research can be done in either
format, though each format has its advantages.
There were some suggestions (in particular one by Eric Adolphson) to
better keep each piece of writing as an individual file (that is, don't
merge them), just place these files in one directory, and then use a text
indexer to index the files and a fast text search engine (for example,
Adolphson uses Open Text). This way, various global searches can be done,
and at the same time the search engine will also show where, in which file,
every instance of an error or item studied is located. To view separately,
and its entirety, any of the original pieces is also easier done this way
than when all the original pieces are merged into one long file. (MY REMARK:
In the WordSmith concordancer one can view one screenful of the context
above and below a selected word.) -- However, for example, word counts are
easier done when the whole corpus is one file.

(3) Several respondents pointed out that the original text should be
reproduced in the corpus exactly as it was written or typed by the student,
including all SPELLING, PUNCTUATION and CAPITALIZATION mistakes or errors.
This is because some researchers might want to analyze precisely these
aspects of the text. So, at least, to each corpus a note should be added
stating whether in the process of compiling the corpus these features had
been retained.
(MY REMARK: I am compiling the corpus here, initially and basically,
just to examine LEXICAL, COLLOCATIONAL, GENERAL WORD USAGE errors. Then by
simple concordancing methods I expect to be able to show the extent or ratio
(which shows the seriousness of the error) of how many times students here
use particular words correctly and how many times incorrectly (for example,
how many times they use "except" for "besides" or "realize" for "understand"
and scores of similar errors). I keep in mind the distinction, made by S.
Pit Corder and subsequently by many others in the error analysis field,
between simple mistakes or performance slips on the one hand and recurring,
characteristic errors on the other. So, when I am typing, I correct what I
judge to be just random SPELLING or typing mistakes -- obvious slips like
typing "hte" for "the" or writing "fell" for "feel" -- in the students'
works, as these are "not interesting" from any linguistic point of view.
Recurrent, often seen spelling errors like "benifit," "vedio," "writting,"
or "morden" (for "modern"), however, I type as they are in the original and
tag them with the correct spelling. As to most PUNCTUATION and
CAPITALIZATION errors, however, I retain them because there are significant
and characteristic differences in this respect between Chinese and English
texts and the students are often influenced by their native Chinese usage.)

(4) Many have stressed (Jeff Williamson in particular) that INFORMATION
ABOUT THE CONDITIONS in which each piece was written should be recorded, the
more information the better. This information can be the topic, the time
limits of writing, the student's level of English, age, gender, etc.
(MY REMARK: In the corpus here, coded headings indicate the topic or
genre, and whether the student is English major or not, and whether he/she
is a regular (day school) student or evening school student. All the same,
separate analyses of these groups would not be significant: day school
students obviously write better and their writings contain fewer errors than
those of evening school students. So all the students here whose writings
are included make up practically just one single group of subjects: Chinese
native speaker college juniors and seniors with 6-7 years of local English
instruction behind them, using a considerably fossilized interlanguage of
English.)

(5) Several respondents described EFL student CORPORA already compiled
or "under construction" by themselves or associates.

- Kojiro Asao, at Tokai University, Japan, and six collaborators from
other institutions, are working on a corpus of English by Japanese
learners. From their web site, http://www.lb.u-tokai.ac.jp/lcorpus/
a sample section of the corpus can be downloaded.

- Gui Shichun, Department of English, Guangdong University of Foreign
Studies, Guangzhou, P.R.China, and associates, work on Corpus-Based
Analysis of Chinese Learner English (CBACLE), aiming at compiling
a one million word corpus representing secondary school to postgraduate
level learners of English, tagging the corpus both for grammatical
and pragmatic errors.

- John Milton, The Hong Kong University of Science & Technology, is
working on a corpus of Hong Kong (Cantonese speaking) students.
Students at the university have to submit an electronic copy of all
of their English assignments, which are then logged centrally by a
server. The corpus collected this way contains now over 10,000,000
words. The corpus is then tagged by software for parts of speech,
and tagged manually for error.

- Joel Walters, Department of English, Bar-Ilan University, Ramat Gan,
Israel, also has an EFL corpus (mostly of native Hebrew and Russian
speakers), already complied, which he is in the process of analyzing
by various software.

(6) Most of these student corpora developed by individual, or small
group, researchers are tagged, or are in the process of being tagged, for
errors. However, it was noted by many that ERROR TAGGING in EFL corpora now
is still far from being globally standardized; many have difficulties
establishing a satisfying system of error tagging.

(7) References have been given by respondents to two institutions which
compile and work on LARGER EFL STUDENT CORPORA.

- the International Corpus of Learner English (ICLE)
a project at the University of Louvain, Belgium, in progress since 1990,
done in cooperation with the International Corpus of English (ICE) at
University College London; the ICLE project is headed by

Professor Sylviane Granger
Centre for English Corpus Linguistics
Universite Catholique de Louvain
College Erasme
Place Blaise Pascal 1
B-1348 Louvain-la-Neuve
Belgium
e-mail: granger@etan.ucl.ac.be
http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/cecl.html or
http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/introduction.html

- Longman publishers, in the UK, has also been reported to have a
large, over 5 million word, student corpus collected from 70 countries; it
is called the Longman Learner's Corpus. They use it when compiling
dictionaries for learners. I haven't, however, received more detailed
reference to any e-mail address or web site where this corpus can be contacted.

(MY REMARK: Very little info is available about how -- small or large --
corpora have been or are being actually compiled. John Milton was the only
one in this round of exchanges who gave that very useful info that a school
can centrally collect electronic copies of students' writings. Are there
any other practical ways of doing this? I don't know how other respondents
collected their corpora. And how, in detail, do researchers at ICLE or at
Longman compile their corpora? -- Newcomers or less experienced researchers
in this field could greatly benefit from knowing these details. -- As for
me, as briefly said in my first posting, I usually take out a few, say 4-5
out of 30, assignments, and make xerox copies of them, before I return them
to the students. These are certainly not the best ones in the bunch;
usually they are just those whose content is for some reason or other
interesting, even though the grammar in them is poor. At the end of the
term or the year, then, I type them up.)

(8) Jeff Williamson brought up the question whether it would be
necessary to get written PERMISSION from the subjects (students) to use
their writings in the corpus. He says that for short, phrase- or
sentence-length examples this would not be necessary, but for full length
essays would be so. (MY REMARK: What do you on the List think about this?
Williamson, I believe, speaks about the situation in the US. Would this be
legally required in other countries, worldwide? I have never thought about
asking my students' permission. Actually on many of the xerox copies I made
years ago the student's name is often left out. Maybe now, at the beginning
of the next school year, during the first class, I'm going to circulate a
sheet with the relevant statement, which they then could sign giving me this
permission.)

(9) Suggestions have been made (in particular by Oliver Strunk) to
COMPARE the EFL corpus with a corpus of writings in the same genre of native
English speakers (e.g. respective corpora of college or high school essays
written by native and non-native students).

(10) Vance Stevens and Suzanne Kemmer also recommended web sites where
FURTHER INFORMATION, references and bibliography can be found about corpora
and concordancing:

Catherine Ball's Tutorial: Concordances and Corpora at
http://www.georgetown.edu/cball/corpora/tutorial.html

Michael Barlow's Corpus Linguistics page at
http://www.ruf.rice.edu/~barlow/corpus.html

Tim Johns Data-driven Learning Page (info on concordancing) at
http://web.bham.ac.uk/johnstf/timconc.htm

- - - - -

As I reported in my first posting, an interim (not final) set (about 120,000
words, one third of the planned whole) of the EFL corpus I'm working now
will be shortly ready for downloading from the University's FTP server here;
only a few technical points have to be decided. I'll let know everyone who
requested notice about this when the file is available.

Best wishes to all,

Colman Bernath

-------------------
Colman Bernath
c/o Department of English
Soochow University, Taipei, TAIWAN
colber@mbm1.scu.edu.tw