Re: [Corpora-List] What proportion of letter ngrams occur in English?

From: William Fletcher (fletcher@usna.edu)
Date: Mon Feb 02 2004 - 14:55:21 MET

  • Next message: Roberto Basili: "[Corpora-List] CfP: LREC04 Workshop on "Beyond Named Entity Recognition Semantic labelling for NLP tasks""

    To answer this question I used an unreleased version of kfNgram to find
    all 2- and 3-chargrams in "words" occurring 15 or more times in the BNC,
    where word is defined as "sequence of alphabetic characters". There
    are:
      648 2-chargrams
      7,781 3-chargrams

    Many of the combinations probably do not reflect "legal" English
    sequences, as there are abbreviations and foreign words in the corpus.

    To help determine which sequences are most common (and thus most
    English) I have made lists sorted by descending frequency as well as
    alphabetically, with data on frequency in types and in tokens. Put a
    cutoff point where you wish. The lists are available in a zip archive
    on both my sites:
    Phrases in English
       http://pie.usna.edu/BNCCharGrams.zip

    KWiCFinder
       http://kwicfinder.com/BNCCharGrams.zip

    Sorry you had to wait years for an answer to the easy part of your
    query, Bruce! I look forward to your analysis of the data.

    Bill Fletcher

    AssocProf William H. Fletcher
    Language Studies Department
    United States Naval Academy
    Annapolis MD 21402 5030

    410-293-6362 [voice]
    410-293-2729 [fax]
    Department
       http://usna.edu/LangStudy/
    Phrases in English
       http://pie.usna.edu/
    KWiCFinder
       http://kwicfinder.com/

    >>> "Bruce L. Lambert, Ph.D." <lambertb@uic.edu> 1/23/2004 4:15:12 PM
    >>>
    I am revisiting an issue I brought up to this list several years ago,
    that
    is, how many legal/pronounceable strings can be generated from a fixed

    alphabet for a string of a given length. For example, in the U.S., the

    average drug name is 8 characters long. Given an alphabet of 26 letters
    and
    8 sequential positions in the string, there are 26^8 possible strings.
    What
    proportion of these would actually be legal, pronounceable strings in
    English? It strikes me that, because of the strong sequential
    constraints
    on English orthography (and phonology), that the pronounceable set is
    much,
    much, much smaller than the entire set of possible strings. But can we

    quantify this?

    A related question: Of the 676 letter bigrams that can be constructed
    from
    a 26 letter alphabet, how many actually occur in English? Of the 17576

    letter trigrams that can be constructed from the English alphabet, how
    many
    actually occur?

    Is there a list of "legal" letter ngrams and/or phoneme ngrams? How can
    I
    learn more about these sequential constraints?

    -bruce



    This archive was generated by hypermail 2b29 : Mon Feb 02 2004 - 15:27:28 MET