[Corpora-List] "Phrases in English" database -- new features

From: William Fletcher (fletcher@usna.edu)
Date: Mon Mar 29 2004 - 00:41:37 MET DST

  • Next message: Brigitte GRAU: "[Corpora-List] Program of workshop Question-Réponse TALN2004"

    Apologies for cross-posting

    Since its launch in December 2003, several new features have been added to the "Phrases in English" (PIE)website (see below for general information):
       http://pie.usna.edu

     -- "Explore POS-Grams" supports investigating Part Of Speech patterns by frequencies of Types or Tokens.

     -- "Simple Search" for n-grams focusses user interface to reduce errors. Special features include:

        - automatic checking and correction of multi-word units (of course > of_course; don't > do n't)

        - "optional wildwords" for fuzzy searches (_the +{AJ?} ~{AJ?} days_ matches both _the good old days_ and _the good days_)

        - "tamecard" search for hyphenated forms matches variants with a space and/or nothing (_data-base_ also matches _data base_ and _database_).

     -- Click on any n-gram to see 50 concordances from the BNC, with information on source texts.

     -- "Chargram", i.e. sequences of n characters, where n falls in the range 1-3. Occurrences of letter sequences can be explored either by position (initial, medial, final) or by frequency in types or tokens.

    Various improvements have resulted directly from user suggestions. All feedback on these and other features will be received enthusiastically!

    - - - - - - - - - - - - - - - - - - - - - - -

    PIE incorporates a database of all 1-6-grams (phrases 1-6 "words" long) with part-of-speech (POS) codes occurring three or more times in the 100-million-word British National Corpus (BNC). One can explore English phraseology either through lists of forms and their frequencies or by searching for specific forms or collocations, e.g. 2-grams of the pattern "ADJ work", to find the most frequent adjectives describing _work_.

    PIE also offers a phrase pattern discovery tool, "phrase-frames": sets of variants of an n-gram identical except for one word (wildcard symbol *). The most frequent and productive 4-frame is "the * of the", with variants such "as the end of the", "the rest of the", "the top of the", "the nature of the"*

    Over the next year PIE will add:

     -- Filtering by text type (domain, genre, target audience) for contrastive studies

     -- Query by regular expression (currently only wildcards are supported)

    In addition, when POS-tagging of the Michigan Corpus of Academic Spoken English (MICASE) http://www.hti.umich.edu/micase/ is complete, a similar database will be created with those data. Finally, when a substantial portion of the American National Corpus (ANC) http://americannationalcorpus.org has been released, a third parallel database will be built. Together these databases will permit comparative studies of phraseology in the principal variants of English.

    Please note:

     -- "Unfiltered" queries which match very large datasets can take a couple of minutes to complete. Please be patient; read the tutorials and FAQ to focus your queries.

     -- Users who cannot access the above site may use
         http://kwicfinder.com/BNC/ (please let me know so we can investigate)

    Acknowledgements

    Above all I am grateful to Michael Stubbs of the University of Trier for detailed suggestions and ongoing discussions that led to the creation and refinement of this site; even the "easy as pie" to remember acronym goes back to him. His research assistants contributed as well: Isabel Barth implemented the original phrase-frame generator and Katrin Ungeheuer offered valuable comments on organization and user-interface for query by text-type. Finally Lou Burnard of the BNC Consortium and David Lee of MICASE granted essential permissions and provided useful feedback on the site.

    Bill Fletcher

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Sending an attachment? See below.
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    AssocProf William H. Fletcher
    Language Studies Department
    United States Naval Academy
    Annapolis MD 21402 5030

    410-293-6362 [voice]
    410-293-2729 [fax]
    Department
       http://usna.edu/LangStudy/
    Phrases in English
       http://pie.usna.edu/
    KWiCFinder
       http://kwicfinder.com/
    - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        Don't worry about other people
        stealing your ideas. If your ideas
        are any good, you'll have to ram
        them down people's throats.
                                      --Howard Aiken

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Our mail server deletes messages with
      certain kinds of attachments without
      notifying the sender or recipient.

      If sending a .doc, .exe or .zip file, please
      rename it to delete the extension before
      sending and let me know in the body
      of the message what kind of file it is.



    This archive was generated by hypermail 2b29 : Mon Mar 29 2004 - 01:07:55 MET DST