Re: Untraditional spellings in corpora

Oliver Strunk (strunk@lingua.fil.ub.es)
Fri, 04 Jul 1997 09:12:05 +0200

I was doing the same thing some time ago, when I started a similar project;
the only thing I recommend to add is a marker that enables you to identify
also the error, p.e.:

... [/error:eaves] [/corrected:waves] ...

Or, even better for processing:

...
<error>eaves</enderror><corrected>waves</endcorrected><errorcode>ortographic
</enderrorcode> ...

It's a little bit more complicated, but in this way you are sure that you
will later be able to identify any part, especially when the error does not
only refer to a single word, but to a sentence (verb position or so on).

I'm doing this in German.

Oliver Strunk
strunk@lingua.fil.ub.es

>
> One way to handle this is to transcribe both the odd spelling and
>the standard spelling, one of them specially marked off somehow. For
>instance, you could use square brackets to transcribe the standard
>form, so that some of your above examples would be:
>
> ... eaves [waves] ...
> ... hafta [have to] ...
>
> (If the boundaries of the item not enclosed in brackets are
>unclear, you'll have to go to explicit boundaries on both
>terms of the pair, such as "{hafta / have to}".)
> The marks, which are really meta-text annotations, should not be
>something that would occur in normal text transcription. And of
>course you can get more sophisticated in at least two ways:
>1) use SGML-type markers and SGML parsing tools; 2) add additional
>special symbols to indicate a category of deviation, such as
>spelling error vs. informal pronunciation vs. standard abbreviation.
>
> You can then write simple text-processing code that will produce
>either version of transcription, depending on what you want to
>do with it.
>
> - Bill Fisher
>
>