Re: [Corpora-List] Frequency list of transformations

From: radev@umich.edu
Date: Fri Jan 21 2005 - 19:03:21 MET

  • Next message: Lou Burnard: "[Corpora-List] unencumbered corpora"

    This is a bit tricky. There is no straighforward way to tell by
    looking at a single pair like "occurence" and "occurrence" that the
    second "r" in the latter is a substitute for a single "r" in the first
    one. You should probably have a prior model of the types of
    substitutions that are likely (e.g., doubling letters in this case).

    A quick solution may involve using the standard diff algorithm.

    Here is what I was able to put together in 10 minutes for you. This
    code is in Perl and it uses a module (Algorithm::Diff) that you can
    download from CPAN.

    ---------------------- mydiff.pl ------------------------
    #!/usr/local/bin/perl

    use Algorithm::Diff qw(diff);

    @i1 = split '', shift;
    @i2 = split '', shift;

    $diffs = diff(\@i1, \@i2);
    foreach $c (@$diffs) {
        foreach $l (@$c) {
            my ($sign, $n, $diff) = @$l;
            printf "$sign$diff ";
        }
        print "\n";
    }
    ---------------------------------------------------------
    ./mydiff.pl "heavie" "heavy"
    -i +y -e
    ---------------------------------------------------------

    Then you can use a shell script:

    ----------------------------------------------------------
    cat file | perl -pe "print './mydiff.pl $_'" | sh > output
    ----------------------------------------------------------

    Here is the output:

    +r
    -o +a
    -m
    -v +f
    -i +y -e
    +r
    -v +f

    You can further pipe it to
    ----------------------------------------------------
    sort output | uniq -c | sort -nr | more
    ----------------------------------------------------

    This will give you all substitutions in decreasing frequency:

    ----------------------------------------------------
          2 -v +f
          2 +r
          1 -o +a
          1 -m
          1 -i +y -e
    ----------------------------------------------------

    Drago

    Marijke Koster wrote:
    >
    > Dear corpora list members,
    >
    > Does anyone have a suggestion for a simple method / a script to extract
    > a frequency list of transformations from a list of spelling errors and
    > corrections?
    >
    > For example here's this tab separated list:
    >
    > wrong correct
    > ----- -------
    > occurence occurrence
    > occosion occasion
    > commputer computer
    > live life
    > heavie heavy
    > geat great
    > save safe
    >
    > After applying the method it should result in something like this
    > 1 rr -> r
    > 1 a -> o
    > 1 m -> mm
    > 2 f -> v
    > 1 y -> ie
    > 1 r -> ()
    >
    > Thanks in advance,



    This archive was generated by hypermail 2b29 : Fri Jan 21 2005 - 19:06:59 MET