Re: Corpora: grammar of English letter-sequences

From: Bill Fisher (william.fisher@nist.gov)
Date: Thu May 04 2000 - 15:14:23 MET DST

  • Next message: James L. Fidelholtz: "Re: Corpora: web-search"

    Geoffrey Sampson wrote:

    > Does anyone know of anything like a grammar of English letter-sequences --
    > a system which generates the range of character-sequences which could
    > plausibly occur as words of English, and a subset of which actually do?

     About a dozen years ago when I was working at TI I did some testing of
    a regular grammar discovery procedure by using words from a dictionary as
    sentences (letters=words). I don't remember that anything very great
    came of it; the hard problem remained of how to make the right
    generalizations to unseen data.

     A year or so ago I did some experiments similar but not identical to
    what you're interested in: I generated all valid word spellings (up to
    a certain number of letters) by generating all letter sequences, running
    each thru the best set of letter-to-phone rules I had, then testing each
    resulting phone sequence for pronounceability by seeing if my
    syllabification software could syllabify it with nothing left over.
    If your definition of a grammar is any device that generates valid
    sentences, I guess I was doing what you asked about. But
    the results were not great, probably because I'd trained up the TTP
    rules on only positive examples. So I trained up another set, this time
    also including a large number of cases like "kkk => /k k k/", and the
    results were better, but still not the kind I'd be proud to publish.

     - Bill Fisher



    This archive was generated by hypermail 2b29 : Thu May 04 2000 - 15:16:04 MET DST