[Corpora-List] Reference corpora for IE available

From: Alberto Lavelli (lavelli@itc.it)
Date: Wed Oct 20 2004 - 18:31:08 MET DST

  • Next message: Steven Halim: "[Corpora-List] Request for MSN Messenger 6.x Chat Logs"

    In the following page

     http://nlp.shef.ac.uk/dot.kom/resources.html

    two datasets for Information Extraction (IE) are made available:

     - a new "corrected" version of the Seminar Announcements dataset
       (below more details about the changes with respect to the version
       available in the RISE repository [1]);

     - the first publicly available version of the Corporate Acquisitions
       dataset.

    A link to the page above will soon be available in the RISE
    repository. This effort is part of an activity related to the
    evaluation methodology for IE [2] carried on by Mary Elaine Califf
    (Illinois State University), Fabio Ciravegna (University of
    Sheffield), Dayne Freitag (Fair Isaac Corporation), Nick Kushmerick
    (University College Dublin), and the Dot.Kom group at ITC-irst (i.e.,
    Claudio Giuliano, Alberto Lavelli and Lorenza Romano). This effort
    has been carried on within the Dot.Kom EU project
    (http://www.dot-kom.org).

    Seminar Announcements

    Main changes with respect to version v1.0 (i.e., the RISE version):

     - obvious annotation errors were corrected

     - the Windows convention of naming files was adopted. It appears
       that under some versions of Windows there are problems with the
       presence of certain characters (e.g., ":") in filenames. To solve
       the problems, we substituted ":" with "_".

     - all <sentence> and <paragraph> tags were stripped from the corpus

     - the documents were made XML-compliant

    Corporate Acquisitions

    The documents are XML-compliant. Please, note that this dataset was
    not available in the RISE repository.

    References

    \[1] RISE. A Repository of Online Information Sources Used in
    Information Extraction Tasks Information Sciences Institute / USC,
    1998.

    \[2] Alberto Lavelli, Mary Elaine Califf, Fabio Ciravegna, Dayne
    Freitag, Claudio Giuliano, Nick Kushmerick, Lorenza Romano. IE
    evaluation: Criticisms and recommendations. In Proceedings of the
    AAAI-04 Workshop on Adaptive Text Extraction and Mining (ATEM-2004),
    San Jose, California, 26 July 2004.



    This archive was generated by hypermail 2b29 : Wed Oct 20 2004 - 19:07:27 MET DST