Re: Corpora: seeking semantic distance tool

Rob Freeman (rjfreeman@email.com)
Sun, 26 Sep 1999 23:45:17 +0800

Doug Cooper wrote:

> Can anybody point to a black box...
>
> The problem arises in text segmentation and grouping --
> ...alternative partitions of
> Thai words (which are normally not segmented, as in top-end /
> to-pend)...
>
> Perl code working from WordNet data, or some publicly available
> thesaurus, would be ideal.

Hi Doug,

Not the solution to your question, but perhaps a solution to your
problem.

I have a "black box" which associates compound words (or compounds of
words) according to groups which tend to be interchangeable in context.
Might be useful if I understand your task correctly to be a search for
the correct association of Thai "syllables" in a given context. I'm not
sure if you could modify it to give useful results when the actual
partition into syllables was ambiguous.

Anyway, you are welcome to try it for non-commercial apps.

Code is in Perl. Requires no data other than text. Associations should
be completely general, i.e. generalize to combinations of words which do
not occur in your original data. It does require a lot of RAM and grunt
to actually calculate the data tables, though, and is currently very
slow to run.

Here's some examples of what it can do:

make some products
Parsed: (make (some products)), score: 1.329954
Parsed: ((make some) products), score: 0.023665

make some money
Parsed: (make (some money)), score: 1.555408
Parsed: ((make some) money), score: 0.042059

make a car
Parsed: (make (a car)), score: 5.689303
Parsed: ((make a) car), score: 2.120204

make another car
Parsed: (make (another car)), score: 1.642482
Parsed: ((make another) car), score: 0.189554

make another try
Parsed: ((make another) try), score: 0.051537
Parsed: (make (another try)), score: 0.039471

go with the president
Parsed: ((go with) (the president)), score: 7.983729
Parsed: (go (with (the president))), score: 4.620297
Parsed: (go ((with the) president)), score: 0.771305
Parsed: (((go with) the) president), score: 0.318181
Parsed: ((go (with the)) president), score: 0.065606

I try to go
Parsed: (i ((try to) go)), score: 4.343059
Parsed: (((i try) to) go), score: 1.297454
Parsed: ((i (try to)) go), score: 1.174891
Parsed: (i (try (to go))), score: 0.553270
Parsed: ((i try) (to go)), score: 0.474397

the election results
Parsed: (the (election results)), score: 89.247596
Parsed: ((the election) results), score: 15.212562

go with the other team
Parsed: ((go with) (the (other team))), score: 4.108766
Parsed: (go (with (the (other team)))), score: 1.710860
Parsed: ((go with) ((the other) team)), score: 0.630559
Parsed: (((go with) the) (other team)), score: 0.216543
Parsed: ((go (with (the other))) team), score: 0.125036
Parsed: (go ((with the) (other team))), score: 0.097901
Parsed: (go (with ((the other) team))), score: 0.092966
Parsed: (((go with) (the other)) team), score: 0.086138
Parsed: ((go (with the)) (other team)), score: 0.059295
Parsed: (go ((with (the other)) team)), score: 0.010613
Parsed: ((((go with) the) other) team), score: 0.002281
Parsed: (((go (with the)) other) team), score: 0.002225
Parsed: ((go ((with the) other)) team), score: 0.000261
Parsed: (go (((with the) other) team)), score: 0.000119

they held an election
Parsed: (they (held (an election))), score: 0.000238
Parsed: ((they held) (an election)), score: 0.000007
Parsed: (((they held) an) election), score: 0.000000

go with her
Parsed: ((go with) her), score: 9.073902
Parsed: (go (with her)), score: 0.107435