Re: [Corpora-List] [osander@gmx.de: How to extract N-grams]

From: P bI K O B_ B.B. (rykov@narod.ru)
Date: Tue Sep 03 2002 - 07:54:35 MET DST

  • Next message: Sampo Nevalainen: "[Corpora-List] Aligner for ParaConc? - summary"

      Stefan - I'm game (as my Irish postgrad friend used to tell me).
      I think - every word in your letter is true.

      Vladimir Rykov

    >
    >Oliver Sander had some further comments on methods for the extraction
    >of N-grams from corpora, but wasn't sure whether the people on the
    >list would be interested to read them. Well, I think we are.
    >
    >I'd like to use this occasion for a short comment on the Computational
    >Linguistics article by Yamamoto and Church that was mentioned in this
    >context. Very interesting work, indeed, but I can't help the feeling
    >that if only we had a MIPS 10000 with 16 GBytes of RAM, the naive
    >approach would do quite well, too. At least for N-grams of any
    >reasonable size.
    >
    >Cheers,
    >Stefan
    >
    >------- Start of forwarded message -------
    >From: Oliver Sander <osander@gmx.de>
    >Subject: How to extract N-grams
    >Date: Sun, 1 Sep 2002 13:42:57 +0200
    >
    >-snip-------------------------------------------------------------------------
    >Hello.
    >
    >[David Graff]
    >> On the contrary, using Perl on a large data set can be reasonably
    >> economical in terms of memory usage if the Perl code is written
    >> reasonably well, which is likely true in the case of Ted Pederson's
    >> package. (Sure, it might take up more active RAM than the equivalent
    >> program written in C in most cases, and it certainly is possible to
    >> write Perl code badly, such that it would run out of memory on any
    >> machine -- the same thing can happen in C, of course...)
    >
    >I basically have the same opinion. Perl does have a memory overhead compared
    >to C/C++, but this overhead grows quite nicely with problem size. So if a
    >computation is completely unfeasable with Perl, chances are bad, that a C
    >implementation can get this right. Even more, the simplicity of Perl makes it
    >easier to play with different approaches. If such an approach seems to be
    >interesting, it might be a good idea to implement it in C, as a kind of
    >"production version".
    >
    >[Stefan Evert]
    >> Memory usage is usually the biggest culprit in performance problems
    >> with large amounts of data. Once the machine runs out of physical RAM
    >> and has to use the swap partition, performance decreases extremely.
    >> Which can be so bad as to give the impression that the program got
    >> stuck entirely (we've had such a case with a hash implementation in C,
    >> actually). Therefore, reducing memory load even by a constant factor
    >> of three or so can make an enormous difference.
    >
    >I agree, that saving storage space, to fit the problem into main memory is
    >sometimes a good solution. But if problems have to be handled, that does not
    >fit into RAM anyway (like 100 mio tokens), you have to take care of memory
    >management yourself. And that may be easier in Perl, Python or another
    >scripting language
    >
    >[Stefan Evert]
    >> I'm not sure what you understand by "reasonably well", but for the
    >> case of counting N-grams I would take the best solution to be
    >>
    >> $freq{join("\t", @n_gram)}++;
    >
    >In terms of computation time (if no memory swapping is needed), this is a
    >good approach. But the memory consumption is quite high. You have to store
    >*all* appearing n-grams, also if they occur only once. When you encounter n
    >consecutive words you don't know if they occur somewhere else, so you have to
    >store them.
    >
    >[Christer Johansson]
    >> Somewhere a sort operation is needed.
    >
    >I think that is a good direction.
    >Nagao ("A new method of N-gram statistics for large number of n and
    > automatic extraction of words and phrases from large text data of
    > Japanese")
    >suggests a sorted suffix-array to extract n-grams. The sorting takes n*lg n,
    >but the huge memory overhead of storing all appearing consecutive tokens is
    >avoided. Moreover, the sorting can be implemented with a merging of sorted
    >subsets, which makes it possible (rather easily) to put parts of the data on
    >disc, whereas only some parts are held in main memory.
    >I think this sorted suffix-array technique is the best scalable (with problem
    >size) algorithm available.
    >
    >[An implementation of this algorithm seems to be available from the
    >author's homepage. --Stefan]
    >
    >Hope, this helped.
    >Cheers,
    >Oliver Sander
    >------- End of forwarded message -------
    >

    -- 
    

    P bI K O B B.B. MOCKBA

    Vladimir Rykov, PhD in Computational Linguistics, MOSCOW http://rykov.narod.ru/ Engl. http://www.blkbox.com/~gigawatt/rykov.html Tel +7-903-749-19-99



    This archive was generated by hypermail 2b29 : Tue Sep 03 2002 - 08:05:00 MET DST