Untraditional spellings in corpora

Su-hsun Tsai (teemsht@ioe.ac.uk)
Wed, 2 Jul 1997 13:38:13 +0100

A while back Colman Bernath mentioned that, while compiling a student-
corpus, he corrected students' misspellings, like "hte" into "the," "fell" into
"feel," but left "benifit," "vedio," "writting," as they were as he saw
characteristic differences for these errors in students writing. I have a similar
question and hope to have some leads from experts in this list.

I am doing a research on the linguistic variation characterized in the on-line
meetings among a small group of EFL teachers. I found that many words
were shortened or misspelled in different manners, like "eaves (waves),"
"shushes (hushes)," "hafta (have to)," "diedn't (didn't)," "ppl (people)," "yrs
(years)," "rl (real life)," "environs (environments)," "claustrephobic
(claustrophobic)," "ho (how)," "w/ (with)," "it (It at sentence initial)," "i (I
for 1st person pronoun)," "y'all (you all),"and many more.

If I correct the above "errors," I would change it from an authentic to my
ideal corpus that I don't think it would be appropriate. If I leave them as they
are, it would definitely influence my quantitative finding resulted from
operating a concordancer, such as negation won't include "diedn't"; personal
pronoun won't find "i," "y(‘all)"; subordinator would exclude "ho"; and many
others.

How would you deal with theme if you are doing a similar research now? I
would appreciate very much for any suggestions.

Su-hsun, research student
IOE, U. of London
teemsht@ioe.ac.uk