Re: Corpora: homophones in English

w.peters@dcs.shef.ac.uk
Fri, 27 Nov 1998 14:19:25 GMT

> From owner-corpora@lists.uib.no Fri Nov 27 11:30:57 1998
> X-Sender: cooper@mozart.inet.co.th
> X-Mailer: QUALCOMM Windows Eudora Light Version 3.0.3 (32)
> Date: Fri, 27 Nov 1998 14:43:42 +0700
> To: corpora@hd.uib.no
> From: Doug Cooper <doug@th.net>
> Subject: Corpora: homophones in English
> Mime-Version: 1.0
> Sender: owner-corpora@lists.uib.no
>
> A question has come up on the Southeast Asian languages list
> (sealang-l) regarding processes that form homophones. The
> argument has been made that because SEA words tend to lose
> syllables, there will be an 'unusual' number of one-syllable
> overlaps.
>
> In quick and dirty terms, some 10-12% of Thai dictionary headword
> forms have two or more entries (these are presumed to reflect
> distinct etymology), and about 12-14% of headword sounds have
> two or more entries. Restricted to a universe of one-syllable
> words, the figures are about 13% (duplicated orthography) and
> 16% (duplicated sounds).
>
> Does anybody have a sense of what the equivalents are for
> English lemmas? For my purposes, all polysemous derivations,
> regardless of POS, are a single entry, while divergent
> etymolgies, even if suspect, are probably acceptable as
> multiple entries.
>
> Yes, I know there is lots of slop involved in making
> such estimates. I'm willing to assume that the lexicographic
> methods of the 60's - 80's on both the Thai and English sides
> are more or less equivalent.
>
> Thanks,
> Doug Cooper

I did a quick and dirty count on LDOCE and CELEX, which have lemmanumbers
that indicate homophones or homonyms.

Regardless of POS, LDOCE has around 34200 lemmas. 6600 have more than
one lemmanumber (19%).
For CELEX, the figures are 46000 and 5700 respectively (12%)

Celex does not represent sense distinctions, and so the lemma numbers only
denote across part of speech boundaries.
In addition to POS, LDOCE does represent homonymy within POS by means of its
lemma numbers.

This probably accounts for the rather large difference in percentage.
Hope this helps a bit.

Best wishes,

Wim Peters

==================================================================
NLP group
Institute for Language, Speech and Hearing
Department of Computer Science
University of Sheffield
Regent Court tel: 00-44-114-2221902
211 Portobello Street fax: 00-44-114-2221810
Sheffield S1 4DP email: W.Peters@dcs.shef.ac.uk
==================================================================