RE: Corpora: approximations (bounds) for edit distance

From: John Goldsmith (ja-goldsmith@uchicago.edu)
Date: Mon Dec 03 2001 - 05:13:35 MET

Next message: Gerald Nelson: "Corpora: Web Concordance of Asian Newspaper English"

Previous message: COMP staff: "Re: Corpora: approximations (bounds) for edit distance"
In reply to: COMP staff: "Re: Corpora: approximations (bounds) for edit distance"
Next in thread: Patrick Ruch: "Re: Corpora: approximations (bounds) for edit distance"
Reply: Patrick Ruch: "Re: Corpora: approximations (bounds) for edit distance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Two comments:
1. An excellent reference from the bioinformatics people is:

Algorithms on Strings, Trees, and Sequences: Computer Science and
Computational Biology by Dan Gusfield

2. Derrick Higgins and I have used a bit-map approach to speed up
morpheme comparison: in practice, you use a 32, 48, or 64 bit field, and
set the first bit to 1 iff there is an a in the word, the second bit to
1 if there is a b in the word, etc. (Size of alphabet depends on your
language.) You then just perform a bit-wise comparison of two words (two
computer words, not linguistic words!). Bit-wise comparison is extremely
fast of course, and you can determine that there is insufficient overlap
between two morphemes very quickly in the vast majority of cases. Then,
if you do find a pair of morphemes whose letters overlap enough to merit
checking them in the usual linear-programming fashion, you can do so --
but the vast majority of the hopeless pairs have been eliminated in a
much more rapid manner.

John A. Goldsmith
Department of Linguistics, The University of Chicago
ja-goldsmith@uchicago.edu
http://humanities.uchicago.edu/faculty/goldsmith

>My other guess is that people working on DNA sequence matching might
have
>worked on this problem. That's all I could contribute - sorry.

>Best,

>Robert Luk

Next message: Gerald Nelson: "Corpora: Web Concordance of Asian Newspaper English"
Previous message: COMP staff: "Re: Corpora: approximations (bounds) for edit distance"
In reply to: COMP staff: "Re: Corpora: approximations (bounds) for edit distance"
Next in thread: Patrick Ruch: "Re: Corpora: approximations (bounds) for edit distance"
Reply: Patrick Ruch: "Re: Corpora: approximations (bounds) for edit distance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Dec 03 2001 - 05:16:40 MET