AW: Corpora: List of abbreviations

Sabathy, Hellfried (Hellfried.Sabathy@bifab.de)
Wed, 6 May 1998 08:46:33 +0200

> "Manuel J. Maña López" wrote:
> >
> > Hello,
> >
> > I am looking for a list of abbreviations of common use in English
> (such as Ltd., Mr., Inc., ...). I have found some of them in Internet
> but they include a lot of acronyms I am not interested in.
> >
> > Does anybody know if there is any available? Thanks.
>
> Pete Whitelock wrote:
>Why not just build your own? Presumably you are interested only
>in those which end in full stop. Go through a corpus and make a
list
>of all strings followed by full stop.
[snip...]
>You have to be slightly careful cos some corpora don't use full
stop
>on any abbreviations.

That is right: I looked for abbrevations in an encyclopedia, and
one third
of all full stops were at sentence ends. Recognition of
abbreviations was
easiest with a combination of rules:
- small caps after the full stop means almost always
abbreviation.
This found 70% of all abbrevations! Only exceptions are words
like
"pnp-transistor" at beginning of next sentence.
- looking at the last 4 characters (in German) or last 3
Characters (in
English, there was a paper on this by Brustkern(?) in the
eighties) can
distinguish between "words" and "garbage". The garbage are the
rest of the abbreviations.

good luck

Hellfried Sabathy