[Corpora-List] Perl efficiency (Re: N-gram string extraction)

From: Stefan Evert (evert@IMS.Uni-Stuttgart.DE)
Date: Thu Aug 29 2002 - 11:18:57 MET DST

  • Next message: Rob Freeman: "Re: [Corpora-List] summary n-grams (follow-up question)"

    At first I thought this would be off-topic and too technical for the
    list, especially now that Andrius has solved his problem. However,
    since a lot of the respondents seem to be using Perl for large data
    manipulation I'd like to share my experiences.

    David Graff wrote:

       On the contrary, using Perl on a large data set can be reasonably
       economical in terms of memory usage if the Perl code is written
       reasonably well, which is likely true in the case of Ted Pederson's
       package. (Sure, it might take up more active RAM than the equivalent
       program written in C in most cases, and it certainly is possible to
       write Perl code badly, such that it would run out of memory on any
       machine -- the same thing can happen in C, of course...)

    I'm not sure what you understand by "reasonably well", but for the
    case of counting N-grams I would take the best solution to be

    $freq{join("\t", @n_gram)}++;

    where the array @n_gram holds the components of an N-gram token. (Less
    experienced Perl programmers might be tempted to use nested arrays, as
    in $freq{$c1}{$c2}{$c3}{$c4}++; but that isn't easily generalised to
    arbitary N-grams anyway). Although Perl's hashes are known to be fast,
    its memory footprint is much larger than that of a similar C program
    (in my experience at least the 3:1 ratio one can observe for Java
    against C/C++, e.g. with different implementations of Xalan).

    Memory usage is usually the biggest culprit in performance problems
    with large amounts of data. Once the machine runs out of physical RAM
    and has to use the swap partition, performance decreases extremely.
    Which can be so bad as to give the impression that the program got
    stuck entirely (we've had such a case with a hash implementation in C,
    actually). Therefore, reducing memory load even by a constant factor
    of three or so can make an enormous difference.

    Gregor Erbach suggested:

       off the top of my head, I would think that the memory problems arise
       from the large number of n-grams for which counts are held in memory.
       so i don't quite see how c/c++ could improve the situation.

    That's quite right. The number of different N-grams grows more or less
    exponentially with N. For instance, I've found almost 250,000
    different character 6-grams among 1 million characters of German text.

    The main difference between C/C++ and Perl (or Java, at that) is the
    amount of overhead. For every N-gram type (i.e. every different
    sequence of N characters) in the corpus, an optimised C implementation
    would typically require N+1 bytes for the N-gram string, 4 bytes for
    the N-gram frequency (as a 32 bit integer) and another 8-20 bytes for
    pointers and indices. Storing the same information in a Perl hash
    requires for each N-gram: a Perl string (which can grow dynamically
    and does probably some other management information as well) as key,
    and a full-fledged Perl scalar (which is quite a big data structure)
    for the frequency value.

    Even worse, Perl will allocate the necessary memory in small chunks as
    it is needed, whereas the C implementation can store its fixed-size
    data in large tables allocated during startup.

    Apart from the memory problems, there is no denying that Perl is
    inherently a slow language, especially when you write "C in Perl" (as
    far as I know, simple computations can be as much as 300 times faster
    in C).

    Although Perl's regular expression engine has a very good reputation,
    a fairly general tokenizer written in Perl -- which has to try various
    regular expressions for every token identified -- will invariably be
    slow. Even if it uses regexes as simple as /\w/ or /\w+/.

    Christer Johansson's guess was:

       Somewhere a sort operation is needed.

    I don't think a sort is really necessary in this case (after all, it's
    just about counting N-gram frequencies), but there is a fair chance
    that the output is sorted, either alphabetically or by frequency.

       I guess that sort operation is
       implemented in a "simple for the programmer" way. Which means that it is
       likely somewhere between n*n and n*n*n in time. Unix sort uses more efficient
       algorithms that are more likely n*log n. One million keys would take
       between 10^12 and 10^18 operations in the slow versions, in the fast sort
       version it is 10^6*log(2?) of 10^6; is it somewhere near 20*10^6? This
       is most likely where your problem is.

    Perl uses the quicksort algorithm (if I'm not mistaken, it does
    actually call qsort() from the standard library), so it should have a
    complexity of O(n*log n). However, at least when a custom sort order
    is used (this includes sorting by reverse frequency), a callback to
    some Perl code has to be made for each comparison. In my experience,
    this can cause a substantial slowdown for large sorts, although it's
    never been a critical problem. It is much more efficient to save the
    frequency data in whatever order they come out of the Perl hash, and
    then apply an external sort program.

    Below, you'll find the best Perl solution that I can think of at the
    moment. Andrius, if you can spare the time it would be interesting to
    see how it compares to Ted's general tool. Usage is

    ngrams.perl 5 file1.txt file2.txt ... > ngrams.list

    etc. I tested it with 1 million characters of German text. On my
    machine, it took about 3 minutes to extract 6-grams from the data, but
    it did eat up quite a lot of memory. I don't know how it performs on
    files with missing newlines, though ...

    Cheers,
    Stefan.

    ==================== ngrams.perl ====================

    #!/usr/bin/perl -w
    # $| = 1;
    use strict;

    my $N = shift @ARGV;

    my %joint_f;
    my @marginal_f;

    my ($j, $f, $idx, $max_idx, $ngram);

    my $tokens = 0;
    my $data = "";
    while (<>) {
      print STDERR "processing line #$.\r" if ($. & 0x1fff) == 0;
      chomp;
      $data .= " ".$_; # assumes line breaks correspond to blanks
      $max_idx = length($data) - $N;
      for ($idx = 0; $idx <= $max_idx; $idx++) {
        $ngram = substr($data,$idx,$N);
        $joint_f{$ngram}++;
        for $j (0 .. $N-1) {
          $marginal_f[$j]{substr($ngram,$j,1)}++;
        }
        $tokens++;
      }
      $data = substr($data, -$N);
    }

    print STDERR "\nWriting $N-grams to stdout ... ";

    my $types = 0;
    print "\tf\t", join("\t", map{"f$_"} (1..$N)), "\n";
    while (($ngram, $f) = each %joint_f) {
      $types++;
      print "$ngram\t$f";
      for $j (0 .. $N-1) {
        print "\t",$marginal_f[$j]{substr($ngram,$j,1)};
      }
      print "\n";
    }

    print STDERR "Done.\n [$tokens tokens, $types types]\n";



    This archive was generated by hypermail 2b29 : Thu Aug 29 2002 - 11:30:09 MET DST