Re: Corpora: ngram frequencies with intervening words?

From: Hristo Tanev (htanev@yahoo.co.uk)
Date: Tue Apr 24 2001 - 07:05:10 MET DST

  • Next message: Steven Krauwer: "Corpora: Re: Arabic vs Spanish diacritics"

    Dear All,
    This topic is interesting, but another question arises
    in me too.
    Does someone know a study for non-strict grammars,
    where a space is allowed between constituents?

    I think for grammars, avoiding POS-tagger errors.
    These grammars allow a defined space between
    constituents, which doesn't contain some POS tags and
    punct marks.

    For example the rule:
    NP -> AP N

    could be translated into the "fuzzy" rule

    NP -> AP (X) N
    , where
    length(X)<=2 and X doesn't contain V, N, A or ","

    This way some errors of POS tagger could be ignored in
    the sequence X.

    I haven't read about such grammars, but it doesn't
    mean they don't exist, still more I am not 100%
    convinced they are effective, but it is interesting,
    isn't it?

    Best wishes,
    Hristo Tanev

    --- Bruce Lambert <lambertb@uic.edu> wrote: >
    Greetings,
    >
    > In the simplest case, when we compute ngram word
    > frequencies, we consider
    > adjacent words as ngrams. But we may also want to
    > know about pairs of words
    > that occur within n words of one another. Is there a
    > program out there to
    > compute ngram frequencies allowing a variable-width
    > window between the
    > words in the bigram? Ideally, the program would
    > allow the user to rank the
    > bigrams not only by bigram frequency, but also by
    > the frequency of the
    > intervening word patterns. For example, in a
    > database of eighth grade
    > science lessons, the bigram "atom smallest" might
    > occur several times in
    > different contexts. I'd like output approximately as
    > follows:
    >
    > atom smallest (3) (1 "was the") (2 "is the")
    >
    > Indicating that the bigram "atom smallest" with
    > window size 2 occurred 3
    > times total, once with the intervening words "was
    > the" and twice with the
    > intervening words "is the".
    >
    > I can think of a brute force way to do this myself,
    > of course, but I'd
    > rather not reinvent the wheel if I can avoid it.
    >
    > -bruce
    >
    >

    ____________________________________________________________
    Do You Yahoo!?
    Get your free @yahoo.co.uk address at http://mail.yahoo.co.uk
    or your free @yahoo.ie address at http://mail.yahoo.ie



    This archive was generated by hypermail 2b29 : Tue Apr 24 2001 - 07:00:54 MET DST