[Corpora-List] First Announcement: Pascal Challenge on Evaluating Machine Learning for Information Extraction from Documents

From: Fabio Ciravegna (f.ciravegna@dcs.shef.ac.uk)
Date: Mon Jun 14 2004 - 12:58:17 MET DST

  • Next message: stanimir rakic: "[Corpora-List] morphology analysis"

    ****apologies for multiple postings****

    First Announcement and Call for Participation in the

          Pascal Challenge on Evaluating Machine Learning for Information
          Extraction from Documents

    The Dot.Kom European project and the Pascal Network of Excellence invite
    you in participating in the Challenge on Evaluation of Machine Learning
    for Information Extraction from Documents. Goal of the challenge is to
    assess the current situation concerning Machine Learning (ML) algorithms
    for Information Extraction (IE), identifying future challenges and to
    foster additional research in the field. Given a corpus of annotated
    documents, the participants will be expected to perform a number of
    tasks; each examining different aspects of the learning process.
    Full description of the challenge can be found at
    http://nlp.shef.ac.uk/pascal/

          Corpus

    A standardised corpus of 1100 Workshop Call for Papers (CFP) will be
    provided. 600 of these documents will be annotated with 12 tags that
    relate to pertinent information (names, locations, dates, etc.). Of the
    annotated documents 400 will be provided to the participants as a
    training set, the remaining 200 will form the unseen test set used in
    the final evaluation. All the documents will be pre-processed to include
    tokenisation, part-of-speech and named-entity information.

          Tasks

    Full scenario: The only mandatory task for participants is learning to
    annotate implicit information: given the 400 training documents, learn
    the textual patterns necessary to extract the annotated information.
    Each participant provides results of a four-fold cross-validation
    experiment using the same document partitions for pre-competitive tests.
    A final test will be performed on the 200 unseen documents.

    Active learning: Learning to select documents: the 400 training
    documents will be divided into fixed subsets of increasing size (e.g.
    10, 20, 30, 50, 75, 100, 150, and 200). The use of the subsets for
    training will show effect of limited resources on the learning process.
    Secondly, given each subset the participants can select the documents to
    add to increment to the next size (i.e. 10 to 20, 20 to 30, etc.), thus
    showing the ability to select the most suitable set of documents to
    annotate.

    Enriched Scenario: the same procedure as task 1, except the participants
    will be able to use the unannotated part of the corpus (500 documents).
    This will show how the use of unsupervised or semi-supervised methods
    can improve the results of supervised approaches. An interesting variant
    of this task could concern the use of unlimited resources, e.g. the Web.

          Participation

    Participants from different fields such as machine learning, text
    mining, natural language processing, etc. are welcome. Participation in
    the challenge is free. After registration, participant will receive the
    corpus of documents to train on and the precise instructions on the
    tasks to be performed. At an established date, participants will be
    required to submit their systems’ answers via a Web portal. An automatic
    scorer will compute the accuracy of extraction. A paper will have to be
    produced in order to describe the system and the results obtained.
    Results of the challenge will be discussed in a dedicated workshop.

          Timetable

    - 30th June 2004: Registration starts: formal definition of the tasks,
    annotated corpus and evaluation server will be made available to
    participants
    - 15th October 2004: Formal evaluation
    - November 2004: Presentation of evaluation at Pascal workshop

          Organizers

    Fabio Ciravegna: University of Sheffield, UK; (coordinator)
    Mary Elaine Califf, Illinois State University, USA,
    Dayne Freitag, Fair Isaac Technologies, USA;
    Nicholas Kushmerick: University College Dublin, Ireland;
    Alberto Lavelli: ITC-Irst, Italy

    Local Organizer: Neil Ireson, University of Sheffield.

          Further Information

    For further details about the challenge, visit http://nlp.shef.ac.uk/pascal/

    For general enquiries about the challenge and its motivations, contact
    Fabio Ciravegna (F.Ciravegna@dcs.shef.ac.uk). For details about
    participation, registration and technical queries, please contact Neil
    Ireson (N.Ireson@dcs.shef.ac.uk).

    -- 
    Professor Fabio Ciravegna,
    Department of Computer Science, University of Sheffield,
    Regent Court, 211 Portobello Street, S1 4DP, Sheffield, UK
    Tel:+44(0)114-22.21940, Fax:+44(0)114-22.21810
    www: http://www.dcs.shef.ac.uk/~fabio/
    



    This archive was generated by hypermail 2b29 : Wed Jun 16 2004 - 23:40:10 MET DST