[Corpora-List] web as corpus at cl2005: call for expressions of interest

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Sat Nov 06 2004 - 15:10:54 MET

  • Next message: Christopher Brewster: "[Corpora-List] diachronic corpus tool"

    Apologies for cross-posting.

    PLANNED COLLOQUIUM ON "THE WEB AS A CORPUS" AT CORPUS LINGUISTICS 2005

    MOTIVATION

    The World Wide Web is a mine of language data of unprecedented richness
    and ease of access (Kilgarriff and Grefenstette 2003). A growing body of
    studies has shown that simple algorithms using Web-based evidence are
    successful at many linguistic tasks, often outperforming sophisticated
    methods based on smaller but more controlled data sources (e.g., Turney
    2001, Keller and Lapata 2003), despite the many peculiariites of data that
    might be used in this way.

    Current Internet-based linguistic studies differ in terms of strategies
    used to access Web data. For example, some researchers collect frequency
    data directly from commercial search engines (e.g., Turney 2001). Others
    use a search engine to find relevant pages, and then retrieve the pages to
    build a corpus (e.g., Ghani et al. 2001, Baroni and Bernardini 2004).
    Others yet build a corpus by spidering the web and manage the data with an
    ad-hoc search engine (e.g., Terra and Clarke 2003).

    Different approaches have also been proposed to the task of sharing
    web-derived data. For example, some researchers make web-mining tools
    available (e.g., Fletcher 2000, Baroni and Bernardini 2004) while others
    provide URL lists that allow users to construct web-corpora (e.g., Ghani
    et al. 2001, Resnik and Smith 2003), and others yet have proposed
    prototypes of Internet search engines for the linguists' community (Kehoe
    and Renouf 2002, Fletcher 2002, Kilgarriff 2003, Resnik and and Elkiss
    2003).

    Many fundamental issues about the viability and exploitation of the
    web as a linguistic corpus must still be explored, or are just
    starting to be tackled. Some of these issues are of theoretical
    interest, such as word frequency distributions and topical biases in
    Internet documents, while other pertain to equally important
    implementational and practical aspects, such as efficient handling of
    massive data sets and the legal standing of indexing for linguistic
    purposes.

    Thus, we believe that the research on the web as corpus is currently
    in a very exciting stage: increasing evidence points to the enormous
    potential of the Internet as a source of linguistic data, but we are
    still far removed from anything like a working, fully-fledged
    linguist's search engine.

    CALL FOR EXPRESSIONS OF INTEREST

    We are planning a colloquium to be held at Corpus Linguistics 2005
    (Birmingham, UK, 14-17 July 2005) in which scholars using (or planning to
    use) the web as a corpus can meet to share experiences and plans.

    Anybody interested in actively participating in the event, by presenting a
    paper on a relevant topic and/or a demonstration of an existing system,
    should fill up the online expression-of-interest at the address specified
    below, as soon as possible, and in any case by DECEMBER 14 2004, to give
    us time to prepare the official colloquium proposal to be submitted for
    review (deadline for submission of colloquium proposals: January 14 2005).

    We will get in touch with those who submitted expressions of interest as
    soon as possible, and in any case by early January 2005.

    WEB-AS-CORPUS COLLOQUIUM ORGANIZERS

    Adam Kilgarriff (Lexicography MasterClass)
    Marco Baroni (University of Bologna)

    WEB-AS-CORPUS EXPRESSION OF INTEREST FORM

    http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html

    CORPUS LINGUISTICS 2005 WEBSITE

    http://www.corpus.bham.ac.uk/conference/



    This archive was generated by hypermail 2b29 : Sat Nov 06 2004 - 15:36:16 MET