Corpora: seeking semantic distance tool

Doug Cooper (doug@th.net)
Thu, 23 Sep 1999 12:38:34 +0700

Can anybody point to a black box that takes as input two small
sets of strings (eg. two sentences), and returns a number that
gives a relative likelihood that they deal with the same subject?

The problem arises in text segmentation and grouping -- the
sentences are the English definitions of alternative partitions of
Thai words (which are normally not segmented, as in top-end /
to-pend). Correct partitions are more likely to be related to each
other (or to a neighbor word) than incorrect partitions.

Yes, I know that any number that pops out won't be
particularly meaningful, but it's better than nothing. Note also
that we don't have nice, neat, accurate, one-word English
glosses for the Thai original, so looking for co-occurrence
stats is not an easy alternative.

Perl code working from WordNet data, or some publicly available
thesaurus, would be ideal.

Thanks in advance,
Doug Cooper
__________________________________________________
1425 VP Tower, 21/45 Soi Chawakun
Rangnam Road, Rajthevi, Bangkok, 10400
doug@th.net (662) 246-8946 fax (662) 246-8789

Southeast Asian Software Research Center, Bangkok
http://seasrc.th.net --> SEASRC Web site
http://seasrc.th.net/sealang --> SEALANG Web site