Re: Corpora: grammar of English letter combinations

From: Bill Fisher (william.fisher@nist.gov)
Date: Wed May 24 2000 - 19:08:24 MET DST

  • Next message: Will Fitzgerald: "RE: Corpora: grammar of English letter combinations"

      Doing some clean-up today, I ran across some
    work that I did in February of last year that
    looks very similar to Geoffrey's; sorry I didn't
    remember it when his first query went out.

      I used a large union pronlex (~400k entries) that
    we have put together here as the source of word
    spellings, which were written out into a corpus file
    with a space between the letters. This was then fed
    into standard utilities in the CMU-Cambridge
    statistical language model toolkit to produce a
    tri-gram backed-off language model. A program of
    mine then generated "sentences" randomly but
    respecting the probabilities of each successive
    "word" choice. Here are some of the better non-
    English (afaik) results, preceded by their estimated
    probabilities per letter:

    # PR/NWORDS SENTENCE ...
    0.0134141155 [ <s> p e d </s> ] ("ped")
    0.0044879812 [ <s> p o n </s> ] ("pon")
    0.0038318857 [ <s> a c k </s> ] ("ack")
    0.0020768188 [ <s> z o </s> ] ("zo")
    0.0017731462 [ <s> a s t </s> ] ("ast")
    0.0011737253 [ <s> p r i n g </s> ] ("pring")
    0.0005763704 [ <s> c o m y </s> ] ("comy")
    0.0003478832 [ <s> w e l l y </s> ] ("welly")
    0.0002926426 [ <s> g l i n g </s> ] ("gling")
    0.0002905756 [ <s> w o o n </s> ] ("woon")
    0.0001774358 [ <s> c o r t s </s> ] ("corts")
    0.0001257791 [ <s> t r a n d </s> ] ("trand")
    0.0000691862 [ <s> f l a d </s> ] ("flad")
    0.0000521828 [ <s> d e c t i o n </s> ] ("dection")
    0.0000517939 [ <s> u n k i n g </s> ] ("unking")
    0.0000360357 [ <s> m i s l y </s> ] ("misly")
    0.0000355770 [ <s> d e n t i o n </s> ] ("dention")
    0.0000339071 [ <s> s a r i c </s> ] ("saric")
    0.0000275131 [ <s> h a n c h </s> ] ("hanch")
    0.0000201202 [ <s> h a i s m </s> ] ("haism")
    0.0000125679 [ <s> p a r g e ' s </s> ] ("parge's")
    0.0000069366 [ <s> t u t i c </s> ] ("tutic")
    0.0000055470 [ <s> p e n i s m </s> ] ("penism")
    0.0000054817 [ <s> h o r t l y </s> ] ("hortly")
    0.0000050649 [ <s> r e - o f f </s> ] ("re-off")
    0.0000030477 [ <s> p y r o f f </s> ] ("pyroff")
    0.0000021522 [ <s> m a b s t </s> ] ("mabst")
    0.0000010393 [ <s> w h a r c h </s> ] ("wharch")
    0.0000009782 [ <s> c h e m i s m ' s </s> ] ("chemism's")
    0.0000006504 [ <s> f a l l i d </s> ] ("fallid")
    0.0000006480 [ <s> d e l u c k </s> ] ("deluck")
    0.0000001139 [ <s> f r i b i o n s </s> ] ("fribions")
    0.0000000390 [ <s> e x p a g e l </s> ] ("expagel")
    0.0000000183 [ <s> p s i o l e s ' </s> ] ("psioles'")
    0.0000000168 [ <s> v a x i n a </s> ] ("vaxina")
    0.0000000038 [ <s> c a t t r o m e d </s> ] ("cattromed")
    0.0000000021 [ <s> n a t i c i v i n g </s> ] ("naticiving")
    0.0000000011 [ <s> h a f t - k o </s> ] ("haft-ko")
    0.0000000004 [ <s> d i t h e = o u t </s> ] ("dithe=out")
    0.0000000001 [ <s> b o y a t o r g e d </s> ] ("boyatorged")
    0.0000000000 [ <s> m e e b r a i r w a r s t s </s> ] ("meebrairwarsts")

      And some of the bad ones were more interesting, such as:

    0.0007833181 [ <s> m c k </s> ] ("mck")

      This is probably due to the ngram model's limited memory
    for context. "#mc", "mck", and "ck#" all seem fairly common;
    but put them together, and you get something that's impossible.

     But we may be delucked by the fribions re-offing and going hortly fallid.

     - Bill F.



    This archive was generated by hypermail 2b29 : Wed May 24 2000 - 19:10:03 MET DST