Re: [Corpora-List] [osander@gmx.de: How to extract N-grams]

From: P bI K O B_ B.B. (rykov@narod.ru)
Date: Tue Sep 03 2002 - 07:54:35 MET DST

Next message: Sampo Nevalainen: "[Corpora-List] Aligner for ParaConc? - summary"

Previous message: Klebanov Beata: "[Corpora-List] Referring expressions: familiarity/accessibility"
In reply to: Stefan Evert: "[Corpora-List] [osander@gmx.de: How to extract N-grams]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Stefan - I'm game (as my Irish postgrad friend used to tell me).
I think - every word in your letter is true.

Vladimir Rykov

>
>Oliver Sander had some further comments on methods for the extraction
>of N-grams from corpora, but wasn't sure whether the people on the
>list would be interested to read them. Well, I think we are.
>
>I'd like to use this occasion for a short comment on the Computational
>Linguistics article by Yamamoto and Church that was mentioned in this
>context. Very interesting work, indeed, but I can't help the feeling
>that if only we had a MIPS 10000 with 16 GBytes of RAM, the naive
>approach would do quite well, too. At least for N-grams of any
>reasonable size.
>
>Cheers,
>Stefan
>
>------- Start of forwarded message -------
>From: Oliver Sander <osander@gmx.de>
>Subject: How to extract N-grams
>Date: Sun, 1 Sep 2002 13:42:57 +0200
>
>-snip-------------------------------------------------------------------------
>Hello.
>
>[David Graff]
>> On the contrary, using Perl on a large data set can be reasonably
>> economical in terms of memory usage if the Perl code is written
>> reasonably well, which is likely true in the case of Ted Pederson's
>> package. (Sure, it might take up more active RAM than the equivalent
>> program written in C in most cases, and it certainly is possible to
>> write Perl code badly, such that it would run out of memory on any
>> machine -- the same thing can happen in C, of course...)
>
>I basically have the same opinion. Perl does have a memory overhead compared
>to C/C++, but this overhead grows quite nicely with problem size. So if a
>computation is completely unfeasable with Perl, chances are bad, that a C
>implementation can get this right. Even more, the simplicity of Perl makes it
>easier to play with different approaches. If such an approach seems to be
>interesting, it might be a good idea to implement it in C, as a kind of
>"production version".
>
>[Stefan Evert]
>> Memory usage is usually the biggest culprit in performance problems
>> with large amounts of data. Once the machine runs out of physical RAM
>> and has to use the swap partition, performance decreases extremely.
>> Which can be so bad as to give the impression that the program got
>> stuck entirely (we've had such a case with a hash implementation in C,
>> actually). Therefore, reducing memory load even by a constant factor
>> of three or so can make an enormous difference.
>
>I agree, that saving storage space, to fit the problem into main memory is
>sometimes a good solution. But if problems have to be handled, that does not
>fit into RAM anyway (like 100 mio tokens), you have to take care of memory
>management yourself. And that may be easier in Perl, Python or another
>scripting language
>
>[Stefan Evert]
>> I'm not sure what you understand by "reasonably well", but for the
>> case of counting N-grams I would take the best solution to be
>>
>> $freq{join("\t", @n_gram)}++;
>
>In terms of computation time (if no memory swapping is needed), this is a
>good approach. But the memory consumption is quite high. You have to store
>*all* appearing n-grams, also if they occur only once. When you encounter n
>consecutive words you don't know if they occur somewhere else, so you have to
>store them.
>
>[Christer Johansson]
>> Somewhere a sort operation is needed.
>
>I think that is a good direction.
>Nagao ("A new method of N-gram statistics for large number of n and
> automatic extraction of words and phrases from large text data of
> Japanese")
>suggests a sorted suffix-array to extract n-grams. The sorting takes n*lg n,
>but the huge memory overhead of storing all appearing consecutive tokens is
>avoided. Moreover, the sorting can be implemented with a merging of sorted
>subsets, which makes it possible (rather easily) to put parts of the data on
>disc, whereas only some parts are held in main memory.
>I think this sorted suffix-array technique is the best scalable (with problem
>size) algorithm available.
>
>[An implementation of this algorithm seems to be available from the
>author's homepage. --Stefan]
>
>Hope, this helped.
>Cheers,
>Oliver Sander
>------- End of forwarded message -------
>

P bI K O B B.B. MOCKBA

Vladimir Rykov, PhD in Computational Linguistics, MOSCOW http://rykov.narod.ru/ Engl. http://www.blkbox.com/~gigawatt/rykov.html Tel +7-903-749-19-99

Next message: Sampo Nevalainen: "[Corpora-List] Aligner for ParaConc? - summary"
Previous message: Klebanov Beata: "[Corpora-List] Referring expressions: familiarity/accessibility"
In reply to: Stefan Evert: "[Corpora-List] [osander@gmx.de: How to extract N-grams]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Sep 03 2002 - 08:05:00 MET DST