Re: Untraditional spellings in corpora

Bill Fisher (william.fisher@nist.gov)
Thu, 3 Jul 1997 11:08:06 -0400

--PART-BOUNDARY=.19707031108.ZM9510.ncsl.nist.gov
Content-Description: Text
Content-Type: text/plain ; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
X-Zm-Decoding-Hint: mimencode -q -u

On Jul 2, 2:41pm, Su-hsun Tsai wrote:
=2E..
> I am doing a research on the linguistic variation characterized in the
on-line
> meetings among a small group of EFL teachers. I found that many words
> were shortened or misspelled in different manners, like "eaves (waves),=
"
> "shushes (hushes)," "hafta (have to)," "diedn't (didn't)," "ppl (people=
),"
"yrs
> (years)," "rl (real life)," "environs (environments)," "claustrephobic=

> (claustrophobic)," "ho (how)," "w/ (with)," "it (It at sentence initial=
)," "i
(I
> for 1st person pronoun)," "y'all (you all),"and many more.
>
> If I correct the above "errors," I would change it from an authentic to=
my
> ideal corpus that I don't think it would be appropriate. If I leave th=
em as
they
> are, it would definitely influence my quantitative finding resulted fro=
m
> operating a concordancer, such as negation won't include "diedn't"; per=
sonal
> pronoun won't find "i," "y(=91all)"; subordinator would exclude "ho"; a=
nd many
> others.
>
> How would you deal with theme if you are doing a similar research now? =
I
> would appreciate very much for any suggestions.
>
=2E..

One way to handle this is to transcribe both the odd spelling and
the standard spelling, one of them specially marked off somehow. For
instance, you could use square brackets to transcribe the standard
form, so that some of your above examples would be:

... eaves [waves] ...
... hafta [have to] ...

(If the boundaries of the item not enclosed in brackets are
unclear, you'll have to go to explicit boundaries on both
terms of the pair, such as "{hafta / have to}".)
The marks, which are really meta-text annotations, should not be
something that would occur in normal text transcription. And of
course you can get more sophisticated in at least two ways:
1) use SGML-type markers and SGML parsing tools; 2) add additional
special symbols to indicate a category of deviation, such as
spelling error vs. informal pronunciation vs. standard abbreviation.

You can then write simple text-processing code that will produce
either version of transcription, depending on what you want to
do with it.

- Bill Fisher

--PART-BOUNDARY=.19707031108.ZM9510.ncsl.nist.gov--