[Corpora-List] N-gram string extraction

From: andrius@ccl.bham.ac.uk
Date: Tue Aug 27 2002 - 16:16:54 MET DST

  • Next message: Klas Prutz: "Re: [Corpora-List] N-gram string extraction"

    Dear list members,

    I am currently working on extraction of statistically significant n-gram
    (1<n<6) strings of alpha-numerical characters from a 100 mln character
    corpus, and I intend to apply different significance tests (MI, t-score,
    log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
    Statistics Package, which seems being able to produce the tasks, however
    it hasn't produced any results after one week of running.
    I have a couple of queries regarding n-gram extraction:
    1. I'd like to ask if members of the list are aware of similar software
    capable of accomplishing the above mentioned tasks reliably and
    efficiently.
    2. And a statistical question. As I need to count association scores for
    trigrams, tetragrams, and pentagrams as well, I plan to split them into
    bigrams consisting of a string of words plus one word [n-1]+[1] and
    count association scores for them.
    Does anyone know if this is a right thing to do from a statistical point
    of view?

    Thank you,
    Andrius Utka

    Research Assistant
    Centre for Corpus Linguistics
    University of Birmingham



    This archive was generated by hypermail 2b29 : Tue Aug 27 2002 - 16:36:07 MET DST