[Corpora-List] N-gram extraction: Found it!

From: andrius@ccl.bham.ac.uk
Date: Wed Aug 28 2002 - 16:21:16 MET DST

  • Next message: Ted Pedersen: "Re: [Corpora-List] N-gram extraction: Found it!"

    Dear list members,

    Thank you for all your suggestions and useful advice. I've collected quite a
    lot of useful information about n-gram extraction, and if I'll have time I
    will try to summarize it.
    However, I have to admit that all this noise was due to one crucial
    mistake, which I have overlooked. Our corpus was special yet in another
    way, I removed end of lines from it, which means the perl script was dealing
    with lines of enourmous size.
    People who know just a little of PERL, will understand why it would take ages
    to process such corpus even with the best written script.
    I realized that when I tried a simple Contantin
    Oras' script and I could see the rate at which the results were
    produced.
    As I mentioned earlier in such cases it would be useful to see some kind
    of intermediate results, which I hadn't with Ted Pedersen's script.
    Sorry about all this confusion. I've greatly benefited from it though.

    Sincerely,
    Andrius Utka
    Research Assistant
    Birmingham University



    This archive was generated by hypermail 2b29 : Wed Aug 28 2002 - 16:38:34 MET DST