[Corpora-List] N-gram string extraction

From: andrius@ccl.bham.ac.uk
Date: Tue Aug 27 2002 - 16:16:54 MET DST

Next message: Klas Prutz: "Re: [Corpora-List] N-gram string extraction"

Previous message: Alexander Gelbukh: "[Corpora-List] CICLing-2003 -- Computational Linguistics, Mexico, February, Springer LNCS"
In reply to: andrius@ccl.bham.ac.uk: "[Corpora-List] Registration for the 7th TELRI seminar"
Next in thread: Klas Prutz: "Re: [Corpora-List] N-gram string extraction"
Reply: Klas Prutz: "Re: [Corpora-List] N-gram string extraction"
Reply: Stefan Evert: "Re: [Corpora-List] N-gram string extraction"
Reply: Dirk Ludtke: "[Corpora-List] n-grams (follow-up question)"
Reply: Ted Pedersen: "Re: [Corpora-List] N-gram string extraction"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear list members,

I am currently working on extraction of statistically significant n-gram
(1<n<6) strings of alpha-numerical characters from a 100 mln character
corpus, and I intend to apply different significance tests (MI, t-score,
log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
Statistics Package, which seems being able to produce the tasks, however
it hasn't produced any results after one week of running.
I have a couple of queries regarding n-gram extraction:
1. I'd like to ask if members of the list are aware of similar software
capable of accomplishing the above mentioned tasks reliably and
efficiently.
2. And a statistical question. As I need to count association scores for
trigrams, tetragrams, and pentagrams as well, I plan to split them into
bigrams consisting of a string of words plus one word [n-1]+[1] and
count association scores for them.
Does anyone know if this is a right thing to do from a statistical point
of view?

Thank you,
Andrius Utka

Research Assistant
Centre for Corpus Linguistics
University of Birmingham

Next message: Klas Prutz: "Re: [Corpora-List] N-gram string extraction"
Previous message: Alexander Gelbukh: "[Corpora-List] CICLing-2003 -- Computational Linguistics, Mexico, February, Springer LNCS"
In reply to: andrius@ccl.bham.ac.uk: "[Corpora-List] Registration for the 7th TELRI seminar"
Next in thread: Klas Prutz: "Re: [Corpora-List] N-gram string extraction"
Reply: Klas Prutz: "Re: [Corpora-List] N-gram string extraction"
Reply: Stefan Evert: "Re: [Corpora-List] N-gram string extraction"
Reply: Dirk Ludtke: "[Corpora-List] n-grams (follow-up question)"
Reply: Ted Pedersen: "Re: [Corpora-List] N-gram string extraction"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Aug 27 2002 - 16:36:07 MET DST