Corpora: Summary: Fast edit distance algorithms

From: Mark Lewellen (mlewellen@apptek.com)
Date: Thu Jan 13 2000 - 23:09:04 MET

  • Next message: Claus Pusch: "Corpora: Calls / Romance Corpus Linguistics"

    Hello:

      Following is a summary of responses to my query about edit
    distance algorithms. I sincerely apologize for the delayed
    response. The original query was:

    > Can anyone point me to relatively recent work (roughly
    > the last five years) on fast methods of computing edit distance?
    > I'm interested in algorithms that compare pairs of individual words
    > (not algorithms that search long text strings for approximate matches
    > of words). Opinions on fastest methods welcome also. I'll post a
    > summary of responses to the list.

    Many thanks to those who responded:

    Kemal Oflazer
    Bruce Lambert
    Martin Kay
    Gerardo Sierra
    Moshe Lewenstein
    Paul Bowden
    Gertjan van Noord
    George Demetriou

    Overall summary: A variety of very useful advice on string
    matching was sent. Interestingly, though, none of the responses
    pointed to recent research that focused on speeding up edit-
    distance algorithms. Some related research was by K. Oflazar on
    finite-state methods and Amir et. al. on swaps. Two books in
    the area were written in 1994 by Stephen and Aoe. Some
    people wrote that edit distance algorithms are simply too
    slow, and suggested alternative algorithms. It makes one
    wonder if there is any on-going research out there on speeding
    up edit-distance algorithms! Since edit-distance
    algorithms use dynamic programming, one suggestion involves
    a fast dynamic-programming method. Finally, a code snippet
    was generously provided.

    Mark

    Details:

    ----------------------------------------------------
    References that were suggested are:

    Articles

    1) Oflazer, Kemal, 1996. "Error-tolerant Finite-state Recognition with
    Applications to Morphological Analysis and Spelling Correction".
    Computational Linguistics 22(1):73-89.
    Available at http://www.cs.bilkent.edu.tr/~ko/ko.html.

    This paper describes a method for a very fast way of finding strings
    that are very close to some string in a regular language described by
    a finite state machine.

    2) Amir, Amihood, Yonatan Aumann, Gad M. Landau, Moshe Lewenstein
    and Noa Lewenstein, 1997. "Pattern Matching with Swaps". Proceedings,
    38th Annual IEEE Conference on Foundations of Computer Science (FOCS),
    Miami Beach, Florida, pp. 148-153.

    The algorithms I have worked on considered swaps as an isolated
    error. In other words I considered the problem of string matching
    with swaps, i.e. given a pattern P finds all locations where P
    appears in a text T even if swaps have occured in T.
    In theoretical measures the best known algorithms (that I know of)
    for edit distance with swaps still runs in O(nm) time where n is the
    text size and m the pattern size. There is an algorithm of Landau
    and Vishkin that runs in O(nk) time, where k bounds the number of
    errors.
    Our original algorithm for string matching with swaps (not edit
    distance) runs in O(nm^{1/3} \log m) and you are welcome to take
    it off my homepage at http://www.cs.biu.ac.il/~moshe
    I assume that this is not the problem you are really interested in.
    However, if you are, we will soon have a new paper improving on
    the previous results.
    [The website contains other interesting articles about pattern-matching
    in text, as well.]

    Books

    1) Stephen, Graham A., 1994. String Searching Algorithms. World Scientific
    Publishing, Singapore.
    See http://www1.fatbrain.com/asp/bookinfo/bookinfo.asp?theisbn=9810237030

    2) Aoe, Jun-Ichi, 1994. Computer Algorithms: String Pattern-Matching
    Strategies. IEEE Computer Society.
    See
    http://www.amazon.com/exec/obidos/ASIN/0818654627/o/qid=947796598/sr=8-3/102
    -8070569-5159269

    Projects

    UMIST has just developed an algorithm to identify semantic clusters
    through the alignment of definitions using edit distance. It compares
    two definitions and matches the pair of words that can be substituted
    in a particular context without changing the meaning. The clusters are
    used to expand a query in a dictionary we also designed for searching a
    suitable term whose concept the user knows but which he does not remember
    how to designate correctly.
    http://www.mcsierra.freeserve.co.uk/penpict.html [may be an old link]

    ---------------------------------
    Other approaches/references:

    (1) Edit distance is inherently (algorithmically) complex, so it will never
    be
    really FAST. I've optimized my edit distance code (in Lisp) as much as I
    can, and it's still not fast. The best thing to do if you really need speed
    is NOT to use edit distance, but use ngram methods instead. They can be
    made really fast with special indexing.

    (2) I use something called the Ratcliff-Obershelp algorithm, which returns a
    percentage similarity between any two words.

    Unfortunately, I can't give you a reference to it, as I learned of it many
    years ago in "Computing", a British trade newspaper. But I coded it up in
    'C' and it works a treat! (Perhaps someone could give ME a reference to
    this algorithm?) [from Paul Bowden, pmr@doc.ntu.ac.uk]

    (3) You may want to have a look at agrep's algorithm for approximate
    pattern matching. It is a Boyer-Moore type of algorithm with some
    changes made to it to speed up search considerably according to its
    developers (e-mail sw@cs.arizona.edu or udi@cs.arizona.edu for more
    details).

    You may also find it useful to look at DP-like algorithms in the
    speech recognition area (alignment of phonetic sequences with phonetic
    dictionary entries) as well as the bionformatics area (especially if
    you need to apply efficient searching for common subsequences).

    --------------------------------------------------
    Dynamic Programming

    Surely this is as near as you can get to a classic case of dynamic
    programming and it would therefore be quite surprising of you could do
    better than Floyd's algorithm. Depth-first search is pessimal, but caching
    makes it near to optimal!

    ---------------------------------------------------
    Code

    And finally, some helpful code from Gertjan van Noord:

    below is a C function which calculates Levenshtein distance for
    two given strings. I don't know whether this is fast enough; it is
    about ten times faster than the equivalent Perl version.

    #define MAXLINE 100

    int lev(char a[],char b[]) {
      int arr[MAXLINE][MAXLINE];
      int i,j,l,m,n,add;

      for (i=0;i<=strlen(a);i++) {
        arr[0][i]=i;
      }
      for (j=0;j<=strlen(b);j++) {
        arr[j][0]=j;
      }

      for (j=1;j<=strlen(b);j++) {
        for (i=1;i<=strlen(a);i++) {
          if (a[i-1] == b[j-1])
            { add=0; } else { add=1; }
          m = 1+arr[j-1][i];
          l = 1+arr[j][i-1];
          n = add+arr[j-1][i-1];
          arr[j][i] = (m < l ? (m < n ? m : n): (l < n ? l : n));
        }
      }

      return arr[strlen(b)][strlen(a)];

    }



    This archive was generated by hypermail 2b29 : Thu Jan 13 2000 - 23:06:56 MET