Re: [Corpora-List] N-gram string extraction

From: Ted Pedersen (ted_pedersen@hotmail.com)
Date: Tue Aug 27 2002 - 20:25:26 MET DST

  • Next message: LEE: "[Corpora-List] enriching the concepts of category"

    Hi Andrius,

    We are always happy to hear from users of BSP/NSP. In fact, we
    nearly beg folks to contact us in our READMEs, etc.
    Perhaps you could send me some additional details of what you
    are trying to do, and how you have done it thus far?
    I'm at : tpederse@umn.edu

    One newly added power to NSP is that it allows the user
    to define tokens using regular expressions. So you can say that
    tokens are 2 word sequences that start with the letter 'a'. Or
    they can be two character long sequences, or they can be single
    characters, etc. They can be whatever they want to be really.
    However, a poorly crafted or very complex regular expressions
    can really lead to problems with performance. So the first thing
    I would look at is how you are defining your tokens - and I'd
    be happy to do this - you just need to contact me.

    For anyone on the list who doesn't know where to find NSP or
    what it is, here it is:

    http://www.d.umn.edu/~tpederse/nsp.html

    Cordially,
    Ted Pedersen

    >Dear list members,
    >
    >I am currently working on extraction of statistically significant n-gram
    >(1<n<6) strings of alpha-numerical characters from a 100 mln character
    >corpus, and I intend to apply different significance tests (MI, t-score,
    >log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
    >Statistics Package, which seems being able to produce the tasks, however
    >it hasn't produced any results after one week of running.
    >I have a couple of queries regarding n-gram extraction:
    >1. I'd like to ask if members of the list are aware of similar software
    >capable of accomplishing the above mentioned tasks reliably and
    >efficiently.
    >2. And a statistical question. As I need to count association scores for
    >trigrams, tetragrams, and pentagrams as well, I plan to split them into
    >bigrams consisting of a string of words plus one word [n-1]+[1] and
    >count association scores for them.
    >Does anyone know if this is a right thing to do from a statistical point
    >of view?
    >
    >Thank you,
    >Andrius Utka
    >
    >Research Assistant
    >Centre for Corpus Linguistics
    >University of Birmingham

    --
    Ted Pedersen
    http://www.umn.edu/~tpederse
    

    _________________________________________________________________ Join the world’s largest e-mail service with MSN Hotmail. http://www.hotmail.com



    This archive was generated by hypermail 2b29 : Thu Aug 29 2002 - 22:11:43 MET DST