Re: Corpora: approximations (bounds) for edit distance

From: Bruce L. Lambert (lambertb@uic.edu)
Date: Sat Dec 01 2001 - 05:01:08 MET

  • Next message: COMP staff: "Re: Corpora: approximations (bounds) for edit distance"

    > > Maybe I'm missing something, but the upper bound on edit distance between
    > > two strings is always the length of the longer string, and the lower bound
    > > is always zero (when the strings are identical).
    >
    >I guess it depends on the edit cost of the operations.
    >For most applications, we do expect all edit costs to
    >be larger than or equals to zero and the lb edit
    >distance between any 2 strings is 0. However, the ub
    >of any 2 strings is the length of the longer string
    >is true (I guess) if insert, delete and substitution
    >edit costs are 1 and 0 for correct substitution.
    >If I remember correctly, that's the Levenstein
    >metric but one needs to check the details for
    >correctness. In the applications that I have worked,
    >it is not uncommon to use edit costs are than
    >the Levenstein.

    Hi Robert,

    I agree. I was describing the simplest possible edit distance algorithm,
    with mismatch costs equal to 1 and correct substitutions equal to 0. Edit
    distance gets interesting and powerful when one uses a matrix of
    substitution costs, one cost for each possible
    substitution/insertion/deletion in an alphabet, and when the costs reflect
    some inverse function of similarity or confusability "in the real world".

    The real problem with edit distance is that is slow. What approaches people
    are using to speed up edit distance calculations? One heuristic approach
    that has been tried is to first use a fast approach like ngram similarity,
    then only compute edit distance on pairs whose ngram similarities exceed
    some threshold. Are there other methods?

    -bruce



    This archive was generated by hypermail 2b29 : Sat Dec 01 2001 - 05:13:57 MET