Re: [Corpora-List] N-gram string extraction

From: andrius@ccl.bham.ac.uk
Date: Wed Aug 28 2002 - 10:57:43 MET DST

  • Next message: Christer Johansson: "Re: [Corpora-List] N-gram string extraction"

    Hello Ted,

    Thank you for your reply. I really like your software, that's why I've
    chosen it. It's very flexible, and I don't think it
    is anything wrong with it, but I just thought there are quicker ways.
    It's running for the 7th day now.

      PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND
       7564 pts/0 SW 0:00 590 473 1762 0 0.0 [bash]
       7837 pts/0 R 10690:08 31556 658 10317 7624 2.9 [perl]
      22654 pts/4 SW 0:00 627 473 1750 0 0.0 [bash]
      22713 pts/4 SW 0:06 1609 491 2732 0 0.0 [mutt]
      23249 pts/4 SW 0:02 1048 755 2412 0 0.0 [editor]
      23771 pts/5 S 0:01 792 473 1758 1264 0.4 -bash
      23948 pts/5 R 0:00 358 55 2740 976 0.3 ps v

    Well, that's the whole story. We want to extract statistically
    significant n-gram strings for characters. We thought of ignoring punctuation
    marks except full stops and spaces, so I stripped them off. The corpus is
    14 mln words, which is 64,812,293 characters in 153 files.
    Then we thought as your software is designed for words rather than for
    characters, we will insert spaces between letters. So the text is of the
    form: c h a r a c t e r s a r e t r i c k y... As I rethought it
    afterwards, that wasn't necessary as you can specify tokens in
    token.txt as one character /\w/.
    But for this long running I used /\w+/, "which means one character or
    more", and which is still valid for our corpus. Right? And it is nothing
    like a very complicated regexp, is it?
    I didn't want full stops as tokens, but rather as separators, so I
    didn't specify any regexp for full stops. I tried it on several files to
    check if the globbing is working all right, and it was all right.
    So, on the command line I'm running (it's exact copy from a command line):
    > perl ~/bin/nsp-v0.51/count.pl --token token.txt output.txt *.new
    on one machine and:
    > perl ~/bin/nsp-v0.51/count.pl --token token.txt --ngram 3 output.txt *.new
    on the other.

    As I said it has never produced any results. In cases like this it would
    be very helpful to have some sort of indication of "where we are", as
    now we're wondering if the program is doing one character per second or
    per hour... Sure there is a way to check, but not a very straightforward
    one I guess.
    Well it might be some mistake of mine after all, but I would
    really like to be shown where.

    Thank you,
    Andrius

    > Hi Andrius,
    >
    > We are always happy to hear from users of BSP/NSP. In fact, we
    > nearly beg folks to contact us in our READMEs, etc.
    > Perhaps you could send me some additional details of what you
    > are trying to do, and how you have done it thus far?
    > I'm at : tpederse@umn.edu
    >
    > One newly added power to NSP is that it allows the user
    > to define tokens using regular expressions. So you can say that
    > tokens are 2 word sequences that start with the letter 'a'. Or
    > they can be two character long sequences, or they can be single
    > characters, etc. They can be whatever they want to be really.
    > However, a poorly crafted or very complex regular expressions
    > can really lead to problems with performance. So the first thing
    > I would look at is how you are defining your tokens - and I'd
    > be happy to do this - you just need to contact me.
    >
    > For anyone on the list who doesn't know where to find NSP or
    > what it is, here it is:
    >
    > http://www.d.umn.edu/~tpederse/nsp.html
    >
    > Cordially,
    > Ted Pedersen
    >
    > >Dear list members,
    > >
    > >I am currently working on extraction of statistically significant n-gram
    > >(1<n<6) strings of alpha-numerical characters from a 100 mln character
    > >corpus, and I intend to apply different significance tests (MI, t-score,
    > >log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
    > >Statistics Package, which seems being able to produce the tasks, however
    > >it hasn't produced any results after one week of running.
    > >I have a couple of queries regarding n-gram extraction:
    > >1. I'd like to ask if members of the list are aware of similar software
    > >capable of accomplishing the above mentioned tasks reliably and
    > >efficiently.
    > >2. And a statistical question. As I need to count association scores for
    > >trigrams, tetragrams, and pentagrams as well, I plan to split them into
    > >bigrams consisting of a string of words plus one word [n-1]+[1] and
    > >count association scores for them.
    > >Does anyone know if this is a right thing to do from a statistical point
    > >of view?
    > >
    > >Thank you,
    > >Andrius Utka
    > >
    > >Research Assistant
    > >Centre for Corpus Linguistics
    > >University of Birmingham
    >
    >
    >
    >
    > --
    > Ted Pedersen
    > http://www.umn.edu/~tpederse
    >
    >
    > _________________________________________________________________
    > Join the world?s largest e-mail service with MSN Hotmail.
    > http://www.hotmail.com
    >

    -- 
    Andrius Utka			Centre for Corpus Linguistics
    mailto:andrius@ccl.bham.ac.uk	Department of English
    Tel:    +44 (0)121 414 8135	Birmingham University
    Fax:    +44 (0)121 414 6053	Birmingham B15 2TT
    



    This archive was generated by hypermail 2b29 : Wed Aug 28 2002 - 11:15:48 MET DST