Corpora: LREC 2002 Workshop on LINGUISTIC KNOWLEDGE ACQUISITION AND REPRESENTATION

From: Alessandro Lenci (lenci@ilc.pi.cnr.it)
Date: Mon Jan 28 2002 - 23:02:13 MET

  • Next message: John Goldsmith: "RE: Corpora: Syntactic/Phonologic network?"

                                  LREC 2002 Workshop on

                     LINGUISTIC KNOWLEDGE ACQUISITION AND REPRESENTATION:
                            BOOTSTRAPPING ANNOTATED LANGUAGE DATA

                            Las Palmas, Canary Islands, Spain

                                      2nd June 2002

                              _____________________________

    MOTIVATION AND AIMS

    Provision of large-scale labelled language resources, such as tagged
    corpora or repositories of pre-classified text documents, is a crucial key
    to steady progress in an extremely wide spectrum of research, technological
    and business areas in the HLT sector. The continuously changing demands for
    language-specific and application-dependent annotated data (e.g. at the
    syntactic or at the semantic level), indispensable for design validation
    and efficient software prototyping, however, are daily confronted by the
    labelled-data bottleneck. Hand-crafted resources are often too costly and
    time-consuming to be produced at a sustainable pace, and, in some cases,
    they even exceed the limits of human conscious awareness and descriptive
    capability.

    Possible ways to circumvent, or at least minimise, this problem come from
    the literature on automatic knowledge acquisition and, more generally, from
    the machine-learning community. Annotated data are bootstrapped by training
    a machine-learning classifier with a small sample of pre-annotated data and
    by using the induced classifier to annotate more data. Co-learning provides
    an alternative methodology, which essentially consists in iterative
    cooperation of two or more independent learning systems. Another promising
    route consists in automatically tracking down recurrent knowledge patterns
    in unstructured or implicit information sources (such as free texts or
    machine readable dictionaries) for this information to be moulded into
    explicit representation structures (e.g. subcategorisation frames,
    syntactic-semantic templates, ontology hierarchies etc.).

    We believe that all these attempts at bootstrapping labelled data are not
    only of practical interest (for continuous updating, management and
    validation of dynamic resources), but also point to a bunch of germane
    theoretical issues. In particular, the workshop intends to focus on the
    issue of interaction between techniques for inducing structured knowledge
    from raw data and formal methods of linguistic knowledge representation.
    Gaining insights into this issue is an essential requirement for explaining
    the effective use of linguistic knowledge by cognitive agents. Although the
    cognitive and engineering views of the form and acquisition of linguistic
    knowledge need not be related, data from neuroscience and psychology are
    indeed relevant when evaluating different ways of representing information
    in artificial systems, and different models for linguistic knowledge
    acquisition.

    We encourage in-depth analysis of underlying assumptions of the proposed
    bootstrapping methods and discussion of possible relevant connections with
    existing annotation and representation schemes. This investigation is
    likely to have significant repercussions on the way linguistic resources
    will be designed, developed and used for applications in the years to come.
    As the two aspects of knowledge representation and acquisition are
    profoundly interrelated, progress on both fronts can only be achieved, in
    our view of things, through a full appreciation of this deep interdependency.

    TOPICS OF INTEREST

    Possible themes for contributions are:
    * development of 'data-driven' annotation/representation schemes
    * dynamic update, customisation and tuning of labelled resources through
    acquired data
    * 'hybrid models' of linguistic knowledge extraction, whereby machine
    learning methods are integrated with formal structures of knowledge
    representation
    * incremental linguistic knowledge-bases
    * formal representation and structuring of information flow automatically
    acquired from texts
    * knowledge acquisition and linguistic resources lifecycle
    * linguistic knowledge acquisition and representation in cognitive tasks

    IMPORTANT DATES

    Deadline for workshop abstract submission:
    15th of February 2002

    Notification of acceptance:
    15th of March 2002

    Final version of paper for workshop proceedings:
    15th of April 2002

    Workshop:
    2nd June 2002 (afternoon session)

    SUBMISSIONS

    The organizers welcome contributions describing existing research related
    to the topics of the workshop. Each presentation will be 25 minutes long
    (20 minutes for presentation and 5 minutes for questions and discussion).
    Submissions should include: title; author(s); affiliation(s); and contact
    author's e-mail address, postal address, telephone and fax numbers.
    Abstracts (maximum 500 words, plain-text format) must be sent to:
    simo@ilc.pi.cnr.it

    The final version of the accepted papers should not be longer than 4,000
    words or 10 A4 pages. Instructions for formatting and presentation of the
    final version will be sent to authors upon notification of acceptance.

    ORGANISING COMMITEE

    Alessandro Lenci (Università di Pisa, Italy)
    Simonetta Montemagni (Istituto di Linguistica Computazionale - CNR, Italy)
    Vito Pirrelli (Istituto di Linguistica Computazionale - CNR, Italy)

    PROGRAM COMMITTEE

    Harald Baayen (Max Planck Institute for Psycholinguistics - Nijmegen, The
    Netherlands)
    Rens Bod (University of Amsterdam, Holland)
    Michael R. Brent (Washington University, USA)
    Nicoletta Calzolari (Istituto di Linguistica Computazionale - CNR, Italy)
    Jean-Pierre Chanod (Xerox Research Centre Europe, Grenoble, France)
    Walter Daelemans (University of Antwerp, Belgium)
    Dekang Lin (University of Alberta, Edmonton, Canada)
    Horacio Rodriguez (Universidad Politecnica de Catalunya)
    Fabrizio Sebastiani (Istituto per l'Elaborazione dell'Informazione - CNR,
    Italy)
    Lucy Vanderwende (Microsoft Research, Redmond, USA)
    François Yvon (Ecole Nationale Superieure des Telecommunications, Paris
    Frances)
    Menno van Zaanen (University of Amsterdam, The Netherlands)

    CONTACT PERSON

    Simonetta Montemagni
    Istituto di Linguistica Computazionale (ILC) - CNR
    Area della Ricerca di Pisa
    Via Moruzzi 1, 56124 Pisa, ITALY
    e-mail: simo@ilc.pi.cnr.it



    This archive was generated by hypermail 2b29 : Mon Jan 28 2002 - 23:05:18 MET