Re: [Corpora-List] automatic search for orthographic recurring patterns

From: William Fletcher (fletcher@usna.edu)
Date: Wed Dec 08 2004 - 12:29:15 MET

  • Next message: Andrei Popescu-Belis: "Re: [Corpora-List] corpus of student translations - looking for references"

    Hello Marc,

    For my "Phrases in English" site where I have all "char-grams" of 1-3 in
    the BNC tallied by initial, medial and final position
      http://pie.usna.edu/explorec.html
    I proceeded as follows:

    - normalize and tokenize the corpus and tally the tokens
     
    - take all types above a given frequency cutoff (I believe I used 15, to
    avoid foreign sequences in non-English names etc.) and output a list of
    types and frequencies

    - In view of memory constraints (with higher values of n you get a lot
    of unique chargrams), I made one pass for each combination of position
    and number of characters as follows:

       - initialize an "associative array" to tally the frequency of each
    chargram (I used the Windows dictionary object with PowerBasic)

       - read in the list of types and frequencies

       - break up each type into chargrams and add its frequency to the
    frequency of that chargram in that position, e.g. for the type "corpus"
    and a value of 2,
         "initial" pass: co
         "medial" pass: or rp pu
         "final" pass: us

      - sort the array in reverse frequency order and output all chargrams
    that met my threshold

      - loop back and do next combination of position and number

    (I used a "quick and dirty" ad-hoc implementation for PIE which could
    easily be adapted for command-line use. "Someday" I may integrate this
    capability into kfNgram to give it a nicer interface.)

    Hope this helps,
    Bill Fletcher

    >>> MARC FRYD <marc.fryd@univ-poitiers.fr> 12/08/04 3:38 AM >>>
    Hi,
    Perhaps someone on the List will be able to help me with the following
    datamining problem:

    Given a corpus of isolated lexical units or collocations, I would like
    to determine recurring orthographic patterns whether initial, i.e.
    "CARPO" (carpogenic, carpogenous, carpolite), final i.e. "IONALISM"
    (sensationalism, functionalism, etc.) , or internal, i.e. "CHRON"
    (synchony, synchronize, etc.).
    The output should be arranged so as to show respective productivity for
    each pattern.
    Important constraint: the various patterns will *not* be fed in
    initially but should be extracted as a result of the algorithm.
    I'll post a summary if I get several replies.
    Regards to all list members.
    Marc Fryd



    This archive was generated by hypermail 2b29 : Wed Dec 08 2004 - 13:31:10 MET