[Corpora-List] New website "Phrases in English"

From: William Fletcher (fletcher@usna.edu)
Date: Thu Dec 11 2003 - 17:37:04 MET

  • Next message: yuste@ifi.unizh.ch: "[Corpora-List] COLING: Final Call for Workshop Proposals (and deadline extension)"

    Apologies for cross-posting

    A new website, "Phrases in English" (PIE), has been launched:
      http://pie.usna.edu
    While still under development, PIE already offers much to both linguists and students, and additional features will increase its scope in the future.

    PIE incorporates a database of all 1-6-grams (phrases 1-6 "words" long) with part-of-speech (POS) codes occurring three or more times in the 100-million-word British National Corpus (BNC). One can explore English phraseology either through lists of forms and their frequencies or by searching for specific forms or collocations, e.g. 2-grams of the pattern "ADJ work", to find the most frequent adjectives describing work.

    PIE also offers a phrase pattern discovery tool, "phrase-frames": sets of variants of an n-gram identical except for one word (wildcard symbol *). The most frequent and productive 4-frame is "the * of the", with variants such "as the end of the", "the rest of the", "the top of the", "the nature of the"*

    Over the next year PIE will add:

      -- Click on an n-gram in the query results to see concordances from the BNC

      -- POS-grams and POS-frames for studying the relative productivity of phrase structures

      -- Filtering by text type (domain, genre, target audience) for contrastive studies

      -- Query by regular expression (currently only wildcards are supported)

    In addition, when POS-tagging of the Michigan Corpus of Academic Spoken English (MICASE) http://www.hti.umich.edu/micase/ is complete, a similar database will be created with those data. Finally, when a substantial portion of the American National Corpus (ANC) http://americannationalcorpus.org has been released, a third parallel database will be built. Together these databases will permit comparative studies of phraseology in the principal variants of English.

    Please note:

      -- "Unfiltered" queries which match very large datasets can take several minutes to complete. Please be patient; read the tutorials and FAQ to focus your queries.

      -- Users who cannot access the above site may use
          http://kwicfinder.com/BNC/ (please let me know so we can investigate)

    Acknowledgements

    Above all I am grateful to Michael Stubbs of the University of Trier for detailed suggestions and ongoing discussions that led to the creation and refinement of this site; even the "easy as pie" to remember acronym goes back to him. His research assistants contributed as well: Isabel Barth implemented the original phrase-frame generator and Katrin Ungeheuer offered valuable comments on organization and user-interface for query by text-type. Finally Lou Burnard of the BNC Consortium and David Lee of MICASE granted essential permissions and provided useful feedback on the site.

    All user feedback will be received enthusiastically!

    Bill Fletcher

    fletcher AT usna.edu
    fletcher AT kwicfinder.com

    http://pie.usna.edu
    http://kwicfinder.com



    This archive was generated by hypermail 2b29 : Thu Dec 11 2003 - 17:45:25 MET