Corpora: LREC WORKSHOP : Data Architectures and Software Support for Large Corpora

From: Nancy M. Ide (ide@cs.vassar.edu)
Date: Tue Feb 22 2000 - 18:27:26 MET

  • Next message: Naomi Hallan: "Corpora: Re: 'V-ing' in the BNC"

            *****************************************************************
                               SECOND CALL FOR PAPERS

                                   LREC WORKSHOP

               DATA ARCHITECTURES AND SOFTWARE SUPPORT FOR LARGE CORPORA

                                   May 30, 2000
                                  ATHENS, GREECE

                     http://www.cs.vassar.edu/~ide/anc/lrec.html

           ******************************************************************

                         SUBMISSION DEADLINE : MARCH 7, 2000

            Several software systems for linguistic annotation, search,
            and retrieval of large corpora have been developed within the
            natural language processing community over the past several
            years, including LT-XML (Edinburgh), GATE (Sheffield), IMS
            Corpus Workbench (Stuttgart), Alembic Workbench (Mitre), MATE
            (Edinburgh/Odense/Stuttgart), Silfide (Loria/CNRS), SARA
            (BNC), and several others. Related to and in support of this
            development, there have also been efforts to develop standards
            for encoding and various kinds of linguistic annotation, as
            well as data architectures (e.g., TIPSTER, TalkBank)
            etc. Still other developments, such as the introduction of XML
            and the powerful XSL transformation language and work on
            semi-structured data (e.g., the work of the Lore group at
            Stanford), have also impacted the ways in which corpora and
            other linguistic resources can be represented, stored, and
            accessed.

            Approaches to the fundamental design of the formats, data, and
            tools are varied among current systems for the annotation and
            exploitation of linguistic corpora. A primary reason for this
            diversity is that most developers are concerned with only one
            aspect of the creation/annotation/exploitation
            process. However, in order to work effectively toward
            commonality, the phases of the process must be considered as a
            whole. This demands bringing together researchers and
            developers from a variety of domains in text, speech, video,
            etc., many of whom have previously had little or no contact.

            This workshop is intended to bring these groups together to
            look broadly at the technical issues that bear on the
            development of software systems for the annotation and
            exploitation of linguistic resources. The goal is to lay the
            groundwork for the definition of a data and system
            architecture to support corpus annotation and exploitation
            that can be widely adopted within the community. Among the
            issues to be addressed are:

               o layered data architectures
               o system architectures for distributed databases
               o support for plurality of annotation schemes
               o impact and use of XML/XSL
               o support for multimedia, including speech and video
               o tools for creation, annotation, query and access of corpora
               o mechanisms for linkage of annotation and primary data
               o applicability of semi-structured data models, search and query
                 systems, etc.
               o evaluation/validation of systems and annotations

    ----------------------------------------------------------------------------

    Submissions

    Papers should be submitted in electronic form (preferably postscript,
    but plain ascii, MS Word RTF, or HTML are acceptable) to
    ide@cs.vassar.edu by March 7, 2000. Please include the subject line: LREC WORKSHOP
    SUBMISSION : <authors' last names> -- for example, "LREC WORKSHOP
    SUBMISSION: SMITH, JONES".

    Organizers

           Nancy Ide (contact)
           Department of Computer Science
           Vassar College
           Poughkeepsie, New York 12604-0520 USA
           Tel : +1 914 437 5988
           Fax : +1 914 437 7498
           ide@vassar.edu

           Henry S. Thompson
           Human Communication Research Centre
           2 Buccleuch Place
           Edinburgh EH8 9LW
           SCOTLAND
           Tel : +44 (131) 650 4440
           Fax : +44 (131) 650 4587
           ht@cogsci.ed.ac.uk

    Program Committee

           Steven Bird, Linguistic Data Consortium
           Patrice Bonhomme, LORIA/CNRS
           Roy Byrd, IBM Corporation
           Jean Carletta, HCRC Edinburgh
           Ulrich Heid, IMS Stuttgart
           Hamish Cunningham, Sheffield
           David Day, Mitre Corporation
           Robert Gaizauskas, Sheffield
           Ralph Grishman, New York University
           Nancy Ide, Vassar College (Chair)
           Masato Ishizaki, JAIST
           Dan Jurafsky, University of Colorado at Boulder
           Tony McEnery, Lancaster
           David McKelvie, HCRC Edinburgh
           Laurent Romary, LORIA/CNRS
           Gary Simons, Summer Institute of Linguistics
           Henry Thompson, HCRC Edinburgh
           Yorick Wilks, Sheffield
           Peter Wittenburg, Max Planck Institute
           Remi Zajac, New Mexico State University



    This archive was generated by hypermail 2b29 : Tue Feb 22 2000 - 18:28:13 MET