Corpora: The Web as a corpus

From: Andrew Kehoe (andrew@rdues.liv.ac.uk)
Date: Tue May 02 2000 - 16:15:44 MET DST

  • Next message: miki: "Corpora:Imperatives"

    The WebCorp Project

    Research and Development for English Studies
    University of Liverpool
    U.K.

    Dear corpus linguists

    However large and up-to-date the electronic text corpora available
    are, there will always be aspects of the language which are too rare
    or too new to be evidenced in them. For some time, this Unit has
    therefore been developing an Internet search tool which allows on-line
    access to Web texts as linguistic rather than information sources.

    The prototype version of the tool can be tested at:
    http://webcorp.connect.org.uk/

    The tool allows the user to submit a word or phrase for which
    instances of usage are required. The search term is
    submitted to a web search engine of the user's choice and the tool
    then visits all the web sites found by the search engine,
    automatically extracting concordance lines from them. The search is
    currently customisable in terms of contextual span, case sensitivity
    and output format, with further options under development.

    The user is not required to specify particular web sites to be
    searched. Instead, the tool searches all sites on the web which are
    accessible via the chosen search engine. One of the search engine
    options available is Metacrawler, which itself searches other search
    engines, maximising coverage and automatically removing duplicate
    results.

    The tool is available for trial and you are kindly requested to
    provide feedback on your experience and needs, which will be taken
    into account in ongoing development.

    Andrew Kehoe
    RDUES
    andrew@rdues.liv.ac.uk



    This archive was generated by hypermail 2b29 : Tue May 02 2000 - 16:14:50 MET DST