RE: Corpora: query

Keith J. Miller (keith@mitre.org)
Thu, 28 May 1998 18:47:18 -0400

I'm not aware of any such list, but I'm sure that it would be corpus/domain
specific in any case. A quick way to generate candidates for a list for
your own corpus would be to throw together a perl script that kept
track/count of any words containing any character over (decimal) 128
(assuming your text is in ISO-1 [ISO-8859-1] or some similar encoding), and
then to weed junk out of that list based on your idea of what high-frequency
means. Of course, there are other things besides the accented characters
above 128, but that should give you a pretty good start without much effort.

----- Keith J. Miller
millerk@gusun.georgetown.edu
keith@mitre.org

>-----Original Message-----
>From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no]On
>Behalf Of afc@nwnexus.com
>Sent: Thursday, May 28, 1998 5:14 PM
>To: corpora@hd.uib.no
>Subject: Corpora: query
>
>
>I'm looking for a list of high-frequency foreign words found in English
>text, e.g. words like "cafe" (with an acute accent over the final e),
>"resume" (acute accent over both e's) and facade (where c ==
>c-cedilla), etc.
>
>Does anyone know of such a list? Or pointers to a listing from which
>the list I'm looking for could be extracted?
>
>Many thanks,
>
>Alexander
>
>
>