Re: [Corpora-List] N-gram string extraction

From: Klas Prutz (klas.prytz@ling.uu.se)
Date: Tue Aug 27 2002 - 16:39:45 MET DST

  • Next message: Stefan Evert: "Re: [Corpora-List] N-gram string extraction"

    Hi,

    Just one question: what is a significant n-gram?
    In realtion to what?

    Ragards

    Klas Prytz
    Institutionen för lingvistik
    Uppsala universitet

    On Tue, 27 Aug 2002 andrius@ccl.bham.ac.uk wrote:

    > Dear list members,
    >
    > I am currently working on extraction of statistically significant n-gram
    > (1<n<6) strings of alpha-numerical characters from a 100 mln character
    > corpus, and I intend to apply different significance tests (MI, t-score,
    > log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
    > Statistics Package, which seems being able to produce the tasks, however
    > it hasn't produced any results after one week of running.
    > I have a couple of queries regarding n-gram extraction:
    > 1. I'd like to ask if members of the list are aware of similar software
    > capable of accomplishing the above mentioned tasks reliably and
    > efficiently.
    > 2. And a statistical question. As I need to count association scores for
    > trigrams, tetragrams, and pentagrams as well, I plan to split them into
    > bigrams consisting of a string of words plus one word [n-1]+[1] and
    > count association scores for them.
    > Does anyone know if this is a right thing to do from a statistical point
    > of view?
    >
    > Thank you,
    > Andrius Utka
    >
    > Research Assistant
    > Centre for Corpus Linguistics
    > University of Birmingham
    >
    >



    This archive was generated by hypermail 2b29 : Tue Aug 27 2002 - 16:48:02 MET DST