Re: [Corpora-List] N-gram extraction: Found it!

From: Ted Pedersen (ted_pedersen@hotmail.com)
Date: Wed Aug 28 2002 - 17:08:41 MET DST

  • Next message: Dirk Ludtke: "[Corpora-List] summary n-grams (follow-up question)"

    Hi Andrius,

    I'm glad to hear you isolated the problem. I was just running
    some of my own experiments with comparably sized data and
    was a little perplexed (happily so perhaps) as to why mine was
    running more quickly. But you're absolutely right about the
    negative impact of long lines on Perl. Your suggestion of a
    "progress meter" for NSP makes good sense, and we'll certainly incorporate
    that. It also seems that an "overly long line
    detector" would be a good safety feature.

    BTW There are some rather nice tips from Ken Church about
    n-gram counting of very large files to be found in the
    archives of this list. Check out this thread on the
    good/bad of frequency lists...

    http://www.hit.uib.no/corpora/1995-4/0076.html

    I'm sure the papers mentioned are more complete sources of
    info, but it's sometimes rather fun to see the ebb and flow
    of these previous discussions.

    Best of luck,
    Ted

    >Dear list members,
    >
    >Thank you for all your suggestions and useful advice. I've collected quite
    >a
    >lot of useful information about n-gram extraction, and if I'll have time I
    >will try to summarize it.
    >However, I have to admit that all this noise was due to one crucial
    >mistake, which I have overlooked. Our corpus was special yet in another
    >way, I removed end of lines from it, which means the perl script was
    >dealing
    >with lines of enourmous size.
    >People who know just a little of PERL, will understand why it would take
    >ages
    >to process such corpus even with the best written script.
    >I realized that when I tried a simple Contantin
    >Oras' script and I could see the rate at which the results were
    >produced.
    >As I mentioned earlier in such cases it would be useful to see some kind
    >of intermediate results, which I hadn't with Ted Pedersen's script.
    >Sorry about all this confusion. I've greatly benefited from it though.
    >
    >Sincerely,
    >Andrius Utka
    >Research Assistant
    >Birmingham University

    --
    Ted Pedersen
    http://www.umn.edu/~tpederse
    

    _________________________________________________________________ Join the world’s largest e-mail service with MSN Hotmail. http://www.hotmail.com



    This archive was generated by hypermail 2b29 : Thu Aug 29 2002 - 09:56:34 MET DST