Re: [Corpora-List] N-gram string extraction

From: Ted Pedersen (ted_pedersen@hotmail.com)
Date: Tue Aug 27 2002 - 20:25:26 MET DST

Next message: LEE: "[Corpora-List] enriching the concepts of category"

Previous message: Chris Brew: "Re: [Corpora-List] N-gram string extraction"
Maybe in reply to: andrius@ccl.bham.ac.uk: "[Corpora-List] N-gram string extraction"
Next in thread: andrius@ccl.bham.ac.uk: "Re: [Corpora-List] N-gram string extraction"
Reply: andrius@ccl.bham.ac.uk: "Re: [Corpora-List] N-gram string extraction"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Andrius,

We are always happy to hear from users of BSP/NSP. In fact, we
nearly beg folks to contact us in our READMEs, etc.
Perhaps you could send me some additional details of what you
are trying to do, and how you have done it thus far?
I'm at : tpederse@umn.edu

One newly added power to NSP is that it allows the user
to define tokens using regular expressions. So you can say that
tokens are 2 word sequences that start with the letter 'a'. Or
they can be two character long sequences, or they can be single
characters, etc. They can be whatever they want to be really.
However, a poorly crafted or very complex regular expressions
can really lead to problems with performance. So the first thing
I would look at is how you are defining your tokens - and I'd
be happy to do this - you just need to contact me.

For anyone on the list who doesn't know where to find NSP or
what it is, here it is:

http://www.d.umn.edu/~tpederse/nsp.html

Cordially,
Ted Pedersen

>Dear list members,
>
>I am currently working on extraction of statistically significant n-gram
>(1<n<6) strings of alpha-numerical characters from a 100 mln character
>corpus, and I intend to apply different significance tests (MI, t-score,
>log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
>Statistics Package, which seems being able to produce the tasks, however
>it hasn't produced any results after one week of running.
>I have a couple of queries regarding n-gram extraction:
>1. I'd like to ask if members of the list are aware of similar software
>capable of accomplishing the above mentioned tasks reliably and
>efficiently.
>2. And a statistical question. As I need to count association scores for
>trigrams, tetragrams, and pentagrams as well, I plan to split them into
>bigrams consisting of a string of words plus one word [n-1]+[1] and
>count association scores for them.
>Does anyone know if this is a right thing to do from a statistical point
>of view?
>
>Thank you,
>Andrius Utka
>
>Research Assistant
>Centre for Corpus Linguistics
>University of Birmingham

-- Ted Pedersen http://www.umn.edu/~tpederse

_________________________________________________________________ Join the world’s largest e-mail service with MSN Hotmail. http://www.hotmail.com

Next message: LEE: "[Corpora-List] enriching the concepts of category"
Previous message: Chris Brew: "Re: [Corpora-List] N-gram string extraction"
Maybe in reply to: andrius@ccl.bham.ac.uk: "[Corpora-List] N-gram string extraction"
Next in thread: andrius@ccl.bham.ac.uk: "Re: [Corpora-List] N-gram string extraction"
Reply: andrius@ccl.bham.ac.uk: "Re: [Corpora-List] N-gram string extraction"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Aug 29 2002 - 22:11:43 MET DST