Re: [Corpora-List] N-gram string extraction

From: Chris Brew (cbrew@ling.ohio-state.edu)
Date: Tue Aug 27 2002 - 17:27:29 MET DST

  • Next message: Ted Pedersen: "Re: [Corpora-List] N-gram string extraction"

    There's a recent publication by Mikio Yamamoto and Kenneth W. Church, Computational Linguistics, 27 (1) 1-30 which shows efficient ways to compute a number of
    interesting quantities over all substrings in a corpus.

    Very nice work

    C

    On Tue, Aug 27, 2002 at 05:12:33PM +0200, Stefan Evert wrote:
    >
    > Hi there!
    >
    > I am currently working on extraction of statistically significant n-gram
    > (1<n<6) strings of alpha-numerical characters from a 100 mln character
    > corpus, and I intend to apply different significance tests (MI, t-score,
    > log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
    > Statistics Package, which seems being able to produce the tasks, however
    > it hasn't produced any results after one week of running.
    >
    > That's very probably because it's written in Perl and eating up lots
    > of memory. I don't think there's a way around C/C++ for problems of
    > that size (at the moment, at least).
    >
    > I always thought of NSP as a tool for counting N-grams of _tokens_
    > rather than character. Apparently, you can change its definition of
    > token, but that means using a trivial regular expressions to chop
    > single characters from your 100 million input corpus. Which is going
    > to take ages.
    >
    > I have a couple of queries regarding n-gram extraction:
    > 1. I'd like to ask if members of the list are aware of similar software
    > capable of accomplishing the above mentioned tasks reliably and
    > efficiently.
    >
    > I'm afraid I don't know of any such tools. Technically, counting
    > N-grams produces a very simplistic statistical language model (the
    > kind used to generate random poetry), so perhaps you can dig up
    > something in that area.
    >
    > On the other hand, if you aren't tied to Windows (i.e.\ you have
    > access to a Linux or Solaris computer), there's the IMS Corpus
    > Workbench:
    >
    > http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/
    >
    > which isn't quite as outdated as that web page suggests. Although it
    > isn't obvious from the online materials, the Corpus Workbench could be
    > abused (with the help of a simple Perl script) to do what you want (at
    > the price of wasting lots of disk space). Kind of a last resort, I
    > guess.
    >
    > 2. And a statistical question. As I need to count association scores for
    > trigrams, tetragrams, and pentagrams as well, I plan to split them into
    > bigrams consisting of a string of words plus one word [n-1]+[1] and
    > count association scores for them.
    > Does anyone know if this is a right thing to do from a statistical point
    > of view?
    >
    > Again, I don't know of any well-founded discussion of association
    > scores for N-grams in the literature. I consider it an intriguing
    > problem and plan to do some work in this area when I've finished my
    > thesis on bigram associations.
    >
    > The most systematic approach to N-grams I've come across is
    >
    > J.F. da Silva; G.P. Lopes. "A Local Maxima method and Fair Dispersion
    > Normalization for extracting multi-word units from corpora." MOL 6,
    > 1999.
    >
    > which can be downloaded from the first author's homepage at
    >
    > http://terra.di.fct.unl.pt/~jfs/
    >
    > Their approach is based on breaking up N-grams into pairs of [n-1]+[1]
    > words, too, but I must say that I'm not really convinced this is the
    > right way to go.
    >
    > Cheers,
    > Stefan.
    >
    > --
    > Moral: Early to rise and early to bed
    > makes a male healthy and wealthy and dead.
    > ______________________________________________________________________
    > C.E.R.T. Marbach (CQP Emergency Response Team)
    > http://www.ims.uni-stuttgart.de/~evert schtepf@gmx.de
    >

    -- 
    =================================================================
    Dr. Chris Brew,  Assistant Professor of Computational Linguistics
    Department of Linguistics, 1712 Neil Avenue, Columbus OH 43210
    Tel:  +614 292 5420 Fax: +614 292 8833
    Email:cbrew@ling.osu.edu
    =================================================================
    



    This archive was generated by hypermail 2b29 : Thu Aug 29 2002 - 22:10:50 MET DST