Re: [Corpora-List] N-gram string extraction

From: Stefan Evert (evert@IMS.Uni-Stuttgart.DE)
Date: Tue Aug 27 2002 - 17:12:33 MET DST

  • Next message: David Graff: "Re: [Corpora-List] N-gram string extraction"

    Hi there!

       I am currently working on extraction of statistically significant n-gram
       (1<n<6) strings of alpha-numerical characters from a 100 mln character
       corpus, and I intend to apply different significance tests (MI, t-score,
       log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
       Statistics Package, which seems being able to produce the tasks, however
       it hasn't produced any results after one week of running.

    That's very probably because it's written in Perl and eating up lots
    of memory. I don't think there's a way around C/C++ for problems of
    that size (at the moment, at least).

    I always thought of NSP as a tool for counting N-grams of _tokens_
    rather than character. Apparently, you can change its definition of
    token, but that means using a trivial regular expressions to chop
    single characters from your 100 million input corpus. Which is going
    to take ages.

       I have a couple of queries regarding n-gram extraction:
       1. I'd like to ask if members of the list are aware of similar software
       capable of accomplishing the above mentioned tasks reliably and
       efficiently.

    I'm afraid I don't know of any such tools. Technically, counting
    N-grams produces a very simplistic statistical language model (the
    kind used to generate random poetry), so perhaps you can dig up
    something in that area.

    On the other hand, if you aren't tied to Windows (i.e.\ you have
    access to a Linux or Solaris computer), there's the IMS Corpus
    Workbench:

    http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/

    which isn't quite as outdated as that web page suggests. Although it
    isn't obvious from the online materials, the Corpus Workbench could be
    abused (with the help of a simple Perl script) to do what you want (at
    the price of wasting lots of disk space). Kind of a last resort, I
    guess.

       2. And a statistical question. As I need to count association scores for
       trigrams, tetragrams, and pentagrams as well, I plan to split them into
       bigrams consisting of a string of words plus one word [n-1]+[1] and
       count association scores for them.
       Does anyone know if this is a right thing to do from a statistical point
       of view?

    Again, I don't know of any well-founded discussion of association
    scores for N-grams in the literature. I consider it an intriguing
    problem and plan to do some work in this area when I've finished my
    thesis on bigram associations.

    The most systematic approach to N-grams I've come across is

    J.F. da Silva; G.P. Lopes. "A Local Maxima method and Fair Dispersion
    Normalization for extracting multi-word units from corpora." MOL 6,
    1999.

    which can be downloaded from the first author's homepage at

      http://terra.di.fct.unl.pt/~jfs/

    Their approach is based on breaking up N-grams into pairs of [n-1]+[1]
    words, too, but I must say that I'm not really convinced this is the
    right way to go.

    Cheers,
    Stefan.

    -- 
    Moral: Early to rise and early to bed
           makes a male healthy and wealthy and dead.
    ______________________________________________________________________
    C.E.R.T. Marbach                         (CQP Emergency Response Team)
    http://www.ims.uni-stuttgart.de/~evert                  schtepf@gmx.de
    



    This archive was generated by hypermail 2b29 : Tue Aug 27 2002 - 17:18:54 MET DST