[Corpora-List] What proportion of letter ngrams occur in English?

From: Bruce L. Lambert, Ph.D. (lambertb@uic.edu)
Date: Fri Jan 23 2004 - 22:15:12 MET

  • Next message: Girju, Corina R.: "[Corpora-List] Workshop on Computational Lexical Semantics (HLT/NAACL-2004): Deadline extended!"

    I am revisiting an issue I brought up to this list several years ago, that
    is, how many legal/pronounceable strings can be generated from a fixed
    alphabet for a string of a given length. For example, in the U.S., the
    average drug name is 8 characters long. Given an alphabet of 26 letters and
    8 sequential positions in the string, there are 26^8 possible strings. What
    proportion of these would actually be legal, pronounceable strings in
    English? It strikes me that, because of the strong sequential constraints
    on English orthography (and phonology), that the pronounceable set is much,
    much, much smaller than the entire set of possible strings. But can we
    quantify this?

    A related question: Of the 676 letter bigrams that can be constructed from
    a 26 letter alphabet, how many actually occur in English? Of the 17576
    letter trigrams that can be constructed from the English alphabet, how many
    actually occur?

    Is there a list of "legal" letter ngrams and/or phoneme ngrams? How can I
    learn more about these sequential constraints?

    -bruce



    This archive was generated by hypermail 2b29 : Fri Jan 23 2004 - 22:19:57 MET