Re: Corpora: ngram frequencies with intervening words?

From: Bruce Lambert (lambertb@uic.edu)
Date: Tue Apr 24 2001 - 17:41:19 MET DST

  • Next message: Francesca Cantini: "Corpora: Dictionary Building for MT"

    Thanks to Lee Gilliam. Thanks also to Philip Resnik and Ted Pedersen, both
    of whom pointed to Ted's bigram software:

    http://www.d.umn.edu/~tpederse/code.html

    Jens Enlund was kind enough to write his own Perl script to do the job. Not
    exactly what I need, but darn close.

    -bruce

    --------------------

    #!/usr/bin/perl -w

    use strict;

    # Get the words and the max allowed intervening words
    #
    my $w1 = shift @ARGV || die "Missing argument: WORD1\n";
    my $w2 = shift @ARGV || die "Missing argument: WORD2\n";
    my $n = shift @ARGV || die "Missing argument: N\n";

    # globals
    #
    my (%res, $tot);

    # read STDIN line by line
    #
    while (<>) {
      # pattern match
      while (s/\b($w1) +((\w+ ){0,$n}?)($w2)\b//) {
         # Prettify a little
         my $tmp = $2;
         chop $tmp;
         # Count up intervening words (if any) and total
         $tmp ne "" && $res{$tmp}++;
         $tot++;
      }
    }

    # Print results (sloppy, will leave an extra blank before the newline)
    #
    print "$w1 $w2 ($tot) ";

    foreach my $words (sort {$res{$a} <=> $res{$b}} keys %res) {
      print "($res{$words} \"$words\") ";
    }

    print "\n";

    exit;



    This archive was generated by hypermail 2b29 : Tue Apr 24 2001 - 17:39:28 MET DST