student corpus - guidance sought

colber@mbm1.scu.edu.tw
Tue, 27 May 1997 08:18:24 +0800

Has anyone in the List compiled or worked with STUDENT corpora?

I am in the process of putting together a corpus of Chinese college
students' unedited writings in English. The purpose is to subsequently
analyze this corpus, with concordancer and other programs, and find
quantitative information about the extent of some characteristic errors or
other non-native speaker word usages in their writings. This information
can be very valuable in determining syllabuses and directions in secondary
school English instruction.

The corpus is planned to be the size of about 300,000 words, consisting of
800-1000 pieces of written assignments, each anywhere between 150-400 words
long, typed and saved as text files. About one third of these assignments
has already been typed (entered).

I haven't so far used any other STUDENT corpus, from any country. So my
question is: are there any STANDARDS, generally used or accepted electronic
formats, in which these corpora are compiled, saved, and prepared to be used
by others?

Here I briefly describe how the corpus is being compiled here, and will be
very grateful for suggestions or comments whether this way is OK or any
change should be made to comply with accepted forms.

-- Each piece is typed in the Word 6.0 window (in Windows 3.1 environment),
using a fixed space font, making each line about 70 words long, typing the
unedited, uncorrected text (only obvious spelling and punctuation mistakes
made by the students are corrected).

-- An 8-12 character code (number) is typed in the first line. Then one
line is skipped, and the heading (headline) of the piece, as written by the
student, is typed.

-- Paragraphing follows the original, with blank lines between the paragraphs.

-- Before saving the text, possible spelling and other errors made in the
typing process are checked and corrected using Word's spell checker.

-- Then each piece is saved as a "text only with line breaks" file and given
a file name (number).

-- All these files are placed in one directory and backed up to prevent
accidental erasure.

-- Using a simple merger application, the files are merged.

So far, I have already tried using in a concordancer (WordSmith Tools) a
consolidated long file comprising about 350 pieces of writing, about 120,000
words, and there seem to be no problems. Would files compiled this way be
ALSO USABLE in other concordancer or text processing/analyzing programs?

Please send your comments either to the List, or to me. I could certainly
summarize the contents of communications sent to me and send it to the List.

I should also be very happy to eventually make this corpus available to
anyone interested in using it, or exchange it with similar learner corpora
on file, based on writings of other Chinese or Japanese students, or
English-learning college students in any country.

Best to all,

Colman Bernath
-------------------
Colman Bernath
c/o Department of English
Soochow University, Taipei, TAIWAN
colber@mbm1.scu.edu.tw