RE: Re[2]: Corpora: query

Keith J. Miller (keith@mitre.org)
Fri, 29 May 1998 12:51:04 -0400

I may have been mistaken, but I understood from the post that he was looking
specifically for foreign words _with_ accents. I probably got this idea
because he spelled out the words in lower ASCII, and then specified that he
meant the version with the accent:

-------
I'm looking for a list of high-frequency foreign words found in
English text, e.g. words like "cafe" (with an acute accent over the final
e),
"resume" (acute accent over both e's) and facade (where c == c-cedilla),
etc.
-------

So maybe I just made an unwarranted assumption. (But actually, from a more
recent post, it looks like this is what he probably meant.) Another problem
though, if that is not what he means, but wants a list including words like
"resume" (unaccented), is deciding what constitutes a "foreign" word. I
personally would not consider "kindergarten" to be a foreign word, despite
its obvious German origin, and despite the fact that it's not made up of
analyzable English morphemes. But "festschrift", "verboten", "faux pas" I
would. "resume", no, but "résumé" (with accents) yes. It goes without
saying that at some level most of English is composed of "foreign" words; I
guess it all depends on his intended purpose.

----- Keith

Keith J. Miller
millerk@gusun.georgetown.edu
keith@mitre.org

>-----Original Message-----
>From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no]On
>Behalf Of Max Schulze
>Sent: Friday, May 29, 1998 11:11 AM
>To: afc@nwnexus.com; corpora@hd.uib.no; bschulze
>Subject: Re[2]: Corpora: query
>
>
> I completely disagree. You would throw out so many foreign words
> consisting of only ASCII characters, such as 'kindergarten',
> 'festschrift', 'verboten', 'faux pas' etc. Additionally, many words
> with diacritical characters occur very often 'de-accented', e.g.
> resume.
>
> A two-step approach could be better: Use on-line dictionaries to
> generate a list of possible candidates (should give a list of foreign
> words -- without frequency), and then use corpora to determine the
> frequencies. Biggest problem will be, of course, the availability of
> suitable on-line dictionaries ...
>
> Max
> ---
> Bruno Maximilian Schulze
> Pagis Indexing Sr. SW Engineer
> ScanSoft, Inc. -- A Xerox Company
> Peabody MA, USA
>
>
>______________________________ Reply Separator
>_________________________________
>Subject: RE: Corpora: query
>Author: keith@mitre.org (Keith J. Miller) at intergate
>Date: 5/28/98 3:47 PM
>
>
>I'm not aware of any such list, but I'm sure that it would be
>corpus/domain
>specific in any case. A quick way to generate candidates for a list for
>your own corpus would be to throw together a perl script that kept
>track/count of any words containing any character over (decimal) 128
>(assuming your text is in ISO-1 [ISO-8859-1] or some similar
>encoding), and
>then to weed junk out of that list based on your idea of what
>high-frequency
>means. Of course, there are other things besides the accented characters
>above 128, but that should give you a pretty good start without
>much effort.
>
> ----- Keith J. Miller
> millerk@gusun.georgetown.edu
> keith@mitre.org
>
>>-----Original Message-----
>>From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no]On
>>Behalf Of afc@nwnexus.com
>>Sent: Thursday, May 28, 1998 5:14 PM
>>To: corpora@hd.uib.no
>>Subject: Corpora: query
>>
>>
>>I'm looking for a list of high-frequency foreign words found in English
>>text, e.g. words like "cafe" (with an acute accent over the final e),
>>"resume" (acute accent over both e's) and facade (where c ==
>>c-cedilla), etc.
>>
>>Does anyone know of such a list? Or pointers to a listing from which
>>the list I'm looking for could be extracted?
>>
>>Many thanks,
>>
>>Alexander
>>
>>
>>
>
>
>