Corpora: ngram frequencies with intervening words?

From: Bruce Lambert (lambertb@uic.edu)
Date: Mon Apr 23 2001 - 22:41:24 MET DST

  • Next message: Steven Krauwer: "Corpora: Re: Arabic vs Spanish diacritics"

    Greetings,

    In the simplest case, when we compute ngram word frequencies, we consider
    adjacent words as ngrams. But we may also want to know about pairs of words
    that occur within n words of one another. Is there a program out there to
    compute ngram frequencies allowing a variable-width window between the
    words in the bigram? Ideally, the program would allow the user to rank the
    bigrams not only by bigram frequency, but also by the frequency of the
    intervening word patterns. For example, in a database of eighth grade
    science lessons, the bigram "atom smallest" might occur several times in
    different contexts. I'd like output approximately as follows:

    atom smallest (3) (1 "was the") (2 "is the")

    Indicating that the bigram "atom smallest" with window size 2 occurred 3
    times total, once with the intervening words "was the" and twice with the
    intervening words "is the".

    I can think of a brute force way to do this myself, of course, but I'd
    rather not reinvent the wheel if I can avoid it.

    -bruce



    This archive was generated by hypermail 2b29 : Mon Apr 23 2001 - 22:37:38 MET DST