Re: Corpora: edit distance and spell checking

From: Patrick Ruch (patrick.ruch@dim.hcuge.ch)
Date: Mon Dec 03 2001 - 11:57:21 MET

  • Next message: Patrick Ruch: "Re: Corpora: edit distance and spell checking"

    SV: Corpora: edit distance and spell checkingHi,

    We are studying the use of some improved batch spell checker (using the linguistic context), some results have been published. Considering the document we are working on, a preliminary named entity recognizer was necessary, but we did not conclude yet.

    -Patrick
      ----- Original Message -----
      From: Kristina Kjellson
      To: CORPORA@HD.UIB.NO
      Sent: Monday, December 03, 2001 11:05 AM
      Subject: SV: Corpora: edit distance and spell checking

      Is there anyone who has tried the perl package string::approx with success when trying to spell check a corpus? Or does anyone have another suggestion? Our aim is to try to generate a lexicon from the corpus but because of the topic, there are lots of frequent spelling mistakes.

      /Kristina Kjellson
      Language engineer
      Nordisk språkteknologi, Norway

      -----Ursprungligt meddelande-----
      Från: Bruce L. Lambert, Ph.D. [mailto:lambertb@uic.edu]
      Skickat: den 30 november 2001 19:43
      Till: CORPORA@HD.UIB.NO
      Ämne: Re: Corpora: approximations (bounds) for edit distance

      Maybe I'm missing something, but the upper bound on edit distance between
      two strings is always the length of the longer string, and the lower bound
      is always zero (when the strings are identical).

      -bruce

      At 06:43 PM 11/29/01 +0000, Computer Researcher wrote:
    >Hi,
    >
    >Does anyone know good approximations (lower and/or upper bounds) to edit
    >distance? (by using some statistical numbers that can be found by
    >preprocessing of the strings)
    >
    >In the preprocess time we can transform the strings to a bunch of numbers
    >(e.g., multi-dimensional vectors); and then use these vectors to
    >approximate the edit distance between strings.
    >
    >I found a paper by Hadlock, F. (1988), proposing a "lower bound" by using
    >frequencies of the letters in the string. Assuming that the alphabet is
    >same for all strings, all frequency vectors will have same number of
    >dimensions. And he defines a distance metric over these vectors, so that
    >this distance (in the vector space) is a lower-bound to the actual edit
    >distance.
    >
    >Do you know any other method that can achieve a similar goal?
    >
    >Thanks for your attention,
    >
    >CR
    >
    >_________________________________________________________________
    >Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp
    >

    ------˙extPart_000_0011_01C17BF1.AA6C3EC0
    Content-Type: text/html;
            charset˙so-8859-1"
    Content-Transfer-Encoding: quoted-printable

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
    <HTML><HEAD><TITLE>SV: Corpora: edit distance and spell checking</TITLE>
    <META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
    <META content="MSHTML 6.00.2600.0" name=GENERATOR>
    <STYLE></STYLE>
    </HEAD>
    <BODY bgColor=#ffffff>
    <DIV><FONT face=Arial size=2>Hi,</FONT></DIV>
    <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
    <DIV><FONT face=Arial size=2>We are studying the use of some improved batch
    spell checker (using the linguistic context), some results have been published.
    Considering the document we are working on, a preliminary named entity
    recognizer was necessary, but we did not conclude yet.</FONT></DIV>
    <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
    <DIV><FONT face=Arial size=2>-Patrick</FONT></DIV>
    <BLOCKQUOTE dir=ltr
    style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
      <DIV style="FONT: 10pt arial">----- Original Message ----- </DIV>
      <DIV
      style="BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: black"><B>From:</B>
      <A title=kristina.kjellson@nst.as
      href="mailto:kristina.kjellson@nst.as">Kristina Kjellson</A> </DIV>
      <DIV style="FONT: 10pt arial"><B>To:</B> <A title=CORPORA@HD.UIB.NO
      href="mailto:CORPORA@HD.UIB.NO">CORPORA@HD.UIB.NO</A> </DIV>
      <DIV style="FONT: 10pt arial"><B>Sent:</B> Monday, December 03, 2001 11:05
      AM</DIV>
      <DIV style="FONT: 10pt arial"><B>Subject:</B> SV: Corpora: edit distance and
      spell checking</DIV>
      <DIV><BR></DIV>
      <P><FONT size=2>Is there anyone who has tried the perl package string::approx
      with success when trying to spell check a corpus? Or does anyone have another
      suggestion? Our aim is to try to generate a lexicon from the corpus but
      because of the topic, there are lots of frequent spelling mistakes.</FONT></P>
      <P><FONT size=2>/Kristina Kjellson</FONT> <BR><FONT size=2>Language
      engineer</FONT> <BR><FONT size=2>Nordisk språkteknologi, Norway</FONT>
      </P><BR><BR><BR><BR>
      <P><FONT size=2>-----Ursprungligt meddelande-----</FONT> <BR><FONT
      size=2>Från: Bruce L. Lambert, Ph.D. [<A
      href="mailto:lambertb@uic.edu">mailto:lambertb@uic.edu</A>]</FONT> <BR><FONT
      size=2>Skickat: den 30 november 2001 19:43</FONT> <BR><FONT size=2>Till:
      CORPORA@HD.UIB.NO</FONT> <BR><FONT size=2>Ämne: Re: Corpora: approximations
      (bounds) for edit distance</FONT> </P><BR>
      <P><FONT size=2>Maybe I'm missing something, but the upper bound on edit
      distance between </FONT><BR><FONT size=2>two strings is always the length of
      the longer string, and the lower bound </FONT><BR><FONT size=2>is always zero
      (when the strings are identical).</FONT> </P>
      <P><FONT size=2>-bruce</FONT> </P><BR>
      <P><FONT size=2>At 06:43 PM 11/29/01 +0000, Computer Researcher wrote:</FONT>
      <BR><FONT size=2>&gt;Hi,</FONT> <BR><FONT size=2>&gt;</FONT> <BR><FONT
      size=2>&gt;Does anyone know good approximations (lower and/or upper bounds) to
      edit </FONT><BR><FONT size=2>&gt;distance? (by using some statistical numbers
      that can be found by </FONT><BR><FONT size=2>&gt;preprocessing of the
      strings)</FONT> <BR><FONT size=2>&gt;</FONT> <BR><FONT size=2>&gt;In the
      preprocess time we can transform the strings to a bunch of numbers
      </FONT><BR><FONT size=2>&gt;(e.g., multi-dimensional vectors); and then use
      these vectors to </FONT><BR><FONT size=2>&gt;approximate the edit distance
      between strings.</FONT> <BR><FONT size=2>&gt;</FONT> <BR><FONT size=2>&gt;I
      found a paper by Hadlock, F. (1988), proposing a "lower bound" by using
      </FONT><BR><FONT size=2>&gt;frequencies of the letters in the string. Assuming
      that the alphabet is </FONT><BR><FONT size=2>&gt;same for all strings, all
      frequency vectors will have same number of </FONT><BR><FONT
      size=2>&gt;dimensions. And he defines a distance metric over these vectors, so
      that </FONT><BR><FONT size=2>&gt;this distance (in the vector space) is a
      lower-bound to the actual edit </FONT><BR><FONT size=2>&gt;distance.</FONT>
      <BR><FONT size=2>&gt;</FONT> <BR><FONT size=2>&gt;Do you know any other method
      that can achieve a similar goal?</FONT> <BR><FONT size=2>&gt;</FONT> <BR><FONT
      size=2>&gt;Thanks for your attention,</FONT> <BR><FONT size=2>&gt;</FONT>
      <BR><FONT size=2>&gt;CR</FONT> <BR><FONT size=2>&gt;</FONT> <BR><FONT
      size=2>&gt;_________________________________________________________________</FONT>
      <BR><FONT size=2>&gt;Get your FREE download of MSN Explorer at <A
      href="http://explorer.msn.com/intl.asp"
      target=_blank>http://explorer.msn.com/intl.asp></FONT> <BR><FONT
      size=2>&gt;</FONT> </P></BLOCKQUOTE></BODY></HTML>

    ------˙extPart_000_0011_01C17BF1.AA6C3EC0--



    This archive was generated by hypermail 2b29 : Tue Dec 04 2001 - 12:31:12 MET