[Corpora-List] [osander@gmx.de: How to extract N-grams]

From: Stefan Evert (evert@IMS.Uni-Stuttgart.DE)
Date: Mon Sep 02 2002 - 19:50:42 MET DST

  • Next message: Klebanov Beata: "[Corpora-List] Referring expressions: familiarity/accessibility"

    Oliver Sander had some further comments on methods for the extraction
    of N-grams from corpora, but wasn't sure whether the people on the
    list would be interested to read them. Well, I think we are.

    I'd like to use this occasion for a short comment on the Computational
    Linguistics article by Yamamoto and Church that was mentioned in this
    context. Very interesting work, indeed, but I can't help the feeling
    that if only we had a MIPS 10000 with 16 GBytes of RAM, the naive
    approach would do quite well, too. At least for N-grams of any
    reasonable size.

    Cheers,
    Stefan

    ------- Start of forwarded message -------
    From: Oliver Sander <osander@gmx.de>
    Subject: How to extract N-grams
    Date: Sun, 1 Sep 2002 13:42:57 +0200

    -snip-------------------------------------------------------------------------
    Hello.

    [David Graff]
    > On the contrary, using Perl on a large data set can be reasonably
    > economical in terms of memory usage if the Perl code is written
    > reasonably well, which is likely true in the case of Ted Pederson's
    > package. (Sure, it might take up more active RAM than the equivalent
    > program written in C in most cases, and it certainly is possible to
    > write Perl code badly, such that it would run out of memory on any
    > machine -- the same thing can happen in C, of course...)

    I basically have the same opinion. Perl does have a memory overhead compared
    to C/C++, but this overhead grows quite nicely with problem size. So if a
    computation is completely unfeasable with Perl, chances are bad, that a C
    implementation can get this right. Even more, the simplicity of Perl makes it
    easier to play with different approaches. If such an approach seems to be
    interesting, it might be a good idea to implement it in C, as a kind of
    "production version".

    [Stefan Evert]
    > Memory usage is usually the biggest culprit in performance problems
    > with large amounts of data. Once the machine runs out of physical RAM
    > and has to use the swap partition, performance decreases extremely.
    > Which can be so bad as to give the impression that the program got
    > stuck entirely (we've had such a case with a hash implementation in C,
    > actually). Therefore, reducing memory load even by a constant factor
    > of three or so can make an enormous difference.

    I agree, that saving storage space, to fit the problem into main memory is
    sometimes a good solution. But if problems have to be handled, that does not
    fit into RAM anyway (like 100 mio tokens), you have to take care of memory
    management yourself. And that may be easier in Perl, Python or another
    scripting language

    [Stefan Evert]
    > I'm not sure what you understand by "reasonably well", but for the
    > case of counting N-grams I would take the best solution to be
    >
    > $freq{join("\t", @n_gram)}++;

    In terms of computation time (if no memory swapping is needed), this is a
    good approach. But the memory consumption is quite high. You have to store
    *all* appearing n-grams, also if they occur only once. When you encounter n
    consecutive words you don't know if they occur somewhere else, so you have to
    store them.

    [Christer Johansson]
    > Somewhere a sort operation is needed.

    I think that is a good direction.
    Nagao ("A new method of N-gram statistics for large number of n and
            automatic extraction of words and phrases from large text data of
            Japanese")
    suggests a sorted suffix-array to extract n-grams. The sorting takes n*lg n,
    but the huge memory overhead of storing all appearing consecutive tokens is
    avoided. Moreover, the sorting can be implemented with a merging of sorted
    subsets, which makes it possible (rather easily) to put parts of the data on
    disc, whereas only some parts are held in main memory.
    I think this sorted suffix-array technique is the best scalable (with problem
    size) algorithm available.

    [An implementation of this algorithm seems to be available from the
    author's homepage. --Stefan]

    Hope, this helped.
    Cheers,
    Oliver Sander
    ------- End of forwarded message -------



    This archive was generated by hypermail 2b29 : Mon Sep 02 2002 - 20:07:35 MET DST