Re: [Corpora-List] N-gram string extraction

From: David Graff (graff@unagi.cis.upenn.edu)
Date: Tue Aug 27 2002 - 17:51:40 MET DST

  • Next message: andrius@ccl.bham.ac.uk: "Re: [Corpora-List] N-gram string extraction"

    evert@IMS.Uni-Stuttgart.DE said:
    > I am currently working on extraction of statistically significant
    > n-gram (1<n<6) strings of alpha-numerical characters from a 100 mln
    > character corpus, and I intend to apply different significance tests
    > (MI, t-score, log-likelihood etc.) on these strings. I'm testing Ted
    > Pedersen's N-gram Statistics Package, which seems being able to
    > produce the tasks, however it hasn't produced any results after one
    > week of running.
    >
    > That's very probably because it's written in Perl and eating up lots
    > of memory. I don't think there's a way around C/C++ for problems of
    > that size (at the moment, at least).

    On the contrary, using Perl on a large data set can be reasonably
    economical in terms of memory usage if the Perl code is written
    reasonably well, which is likely true in the case of Ted Pederson's
    package. (Sure, it might take up more active RAM than the equivalent
    program written in C in most cases, and it certainly is possible to
    write Perl code badly, such that it would run out of memory on any
    machine -- the same thing can happen in C, of course...)

    In this case, it's more likely that the user is missing something
    simple about the basic usage of the package's utility programs -- e.g.
    if a Perl program (let's call it "util.perl") is written in this manner:

      #!/usr/bin/perl

      while (<>) {
         # do stuff...
      }

    and the user simply runs the program at the command line like this:

      util.perl

    that is, with no file name, and no pipeline or redirection to provide
    data on STDIN for the program, it will "run" indefinitely, until the
    user kills it somehow -- it's just waiting for input data to work on.

    Check the documentation for the utility program(s) in question; it may
    just be a matter of making sure that you are using one of the following
    kinds of command line:

       cat data.file | util.perl
    or
       util.perl < data.file
    or
       util.perl data.file

    Or it may be something more subtle in the usage of the package
    programs -- but it's bound to be just a matter of getting the usage
    right.

    -----------
    David Graff Linguistic Data Consortium
    graff@ldc.upenn.edu 3615 Market St., Suite 200
    voice: (215) 898-0887 University of Pennsylvania
    fax: (215) 573-2175 Philadelphia, PA 19104
                    http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Tue Aug 27 2002 - 17:59:04 MET DST