RE: Corpora: approximations (bounds) for edit distance

From: John Goldsmith (ja-goldsmith@uchicago.edu)
Date: Mon Dec 03 2001 - 05:13:35 MET

  • Next message: Gerald Nelson: "Corpora: Web Concordance of Asian Newspaper English"

    Two comments:
    1. An excellent reference from the bioinformatics people is:

    Algorithms on Strings, Trees, and Sequences: Computer Science and
    Computational Biology by Dan Gusfield

    2. Derrick Higgins and I have used a bit-map approach to speed up
    morpheme comparison: in practice, you use a 32, 48, or 64 bit field, and
    set the first bit to 1 iff there is an a in the word, the second bit to
    1 if there is a b in the word, etc. (Size of alphabet depends on your
    language.) You then just perform a bit-wise comparison of two words (two
    computer words, not linguistic words!). Bit-wise comparison is extremely
    fast of course, and you can determine that there is insufficient overlap
    between two morphemes very quickly in the vast majority of cases. Then,
    if you do find a pair of morphemes whose letters overlap enough to merit
    checking them in the usual linear-programming fashion, you can do so --
    but the vast majority of the hopeless pairs have been eliminated in a
    much more rapid manner.

    John A. Goldsmith
    Department of Linguistics, The University of Chicago
    ja-goldsmith@uchicago.edu
    http://humanities.uchicago.edu/faculty/goldsmith

    >My other guess is that people working on DNA sequence matching might
    have
    >worked on this problem. That's all I could contribute - sorry.

    >Best,

    >Robert Luk
     



    This archive was generated by hypermail 2b29 : Mon Dec 03 2001 - 05:16:40 MET