Re: Corpora: corpora: query

John D. Burger (john@mitre.org)
Fri, 29 May 1998 10:30:00 -0400

Alexander Caskey wrote:

> I'm looking for a list of high-frequency foreign words found in
> English text, e.g. words like "cafe" (with an acute accent over the final e),
> "resume" (acute accent over both e's) and facade (where c == c-cedilla), etc.

Perhaps you could give a more specific description of what you mean by
"foreign words". I'm not sure I'd classify any of your examples as foreign -
"facade" in particular has been used in English for about 400 years, the other
two for nearly 200 years.

If you mean words that occur frequently in English and are sometimes spelled
with diacritics or letters other than A to Z, I'd suggest harvesting a
machine-readable dictionary for such words. For example, Webster's has the
following

fa-cade also fa-c,ade

indicating that facade can be spelled with or without the cedilla in English.

You could then get some frequency statistics from a corpus, and cut the list
at a reasonable threshold.

- John Burger
The MITRE Corporation