Re: Corpora: approximations (bounds) for edit distance

From: COMP staff (csrluk@comp.polyu.edu.hk)
Date: Tue Dec 04 2001 - 04:20:55 MET

  • Next message: Rosja Mastop: "Corpora: Second CfP: AC2001 - The Thirteenth Amsterdam Colloquium"

    Hi CR,

    > I'd like to give one more clarification:
    >
    > The preprocessing time is not important for me. Preprocess time can be very
    > high, but the amount of information I gather from preprocess should be
    > limited (space). In other words, I want to preprocess the data and keep some
    > limited information (such as frequencies), and use this information (with
    > less space) for edit distance approximations (bounds).

    > In the example I gave before (keeping the frequencies), instead of keeping
    > the whole strings, we just keep the frequencies and compute a lower bound to
    > the edit distance using the frequency information. For example, in a DNA
    > sequence, there is only four letters and the frequency vector will be only 4
    > dimensions. Instead of the whole strings (which could be large), we just
    > keep 4 integers. What I'm looking for is to do more preprocessing and find a
    > better bound.

    My guess is that it might be possible precompile the (var or fixed) context windows
    if you are using
    the same string for approx. matching. The single-char case is a special
    case with a window of 1 character. This sounds like an
    interesting research problem if there are no body who worked on it. Is
    there any interest to collaborate? Are you working for some company or
    are you a faculty/research student of some University?

    Best Regards,

    Robert Luk
    PS: Viterbi is 1 DP algo. There are other DP algos for more general cases.

    > >and sum them up as some kind of bounds. However, this requires O(|x| |y|)
    > >cost in general.
    > >Another less expensive one is to use a fixed size window (say within 4
    > >characters)
    > >to look for these edit distance bounds and the time complexity is O(|x|),
    > >which is
    > >the same if you compute the character occurrences. You can also increase
    > >the
    > >window size to get better and better estimates of the bounds.
    >
    >
    > _________________________________________________________________
    > Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp
    >
    >



    This archive was generated by hypermail 2b29 : Tue Dec 04 2001 - 04:34:32 MET