Re: [Corpora-List] text categorisation - newspaper

From: Jose Maria Gomez Hidalgo (jmgomez@dinar.esi.uem.es)
Date: Mon Jun 16 2003 - 10:55:04 MET DST

  • Next message: Claudia Kunze: "[Corpora-List] 2nd CfP: GermaNet-Workshop (Oct. 2003)"

    At 09:48 16/06/2003 +0100, Silvia Bernardini wrote:
    >Dear all,
    >
    >We are about to start the categorization of a corpus of Italian newspaper
    >text into a set of broad topics (sports, internal affairs, arts, business,
    >etc). We plan to follow a standard supervised machine learning approach,
    >tagging a subset of the corpus manually, and following the usual
    >train/test/classify cycle.
    >
    >We would like to find information about other projects concerning the
    >categorization of newspaper text -- in particular, we are interested in
    >the topic sets that have been used in similar projects. For example, if
    >somebody has the list of topics used in the AP text cat collection, and
    >could send us a copy, that would be extremely useful.

    An european news categorization project was NAMIC
    (http://www.dcs.shef.ac.uk/nlp/namic/).

    Text categorization test collections for your problem are (in English):
    *
    Reuters-21578
    (http://www.daviddlewis.com/resources/testcollections/reuters21578/)
    * Reuters Corpus, Volume 1
    (http://about.reuters.com/researchandstandards/corpus/) (use this, is much
    bigger and challenging).
    You can get topics from them.

    Also you can use sections of newspapers.

    For information on TC, and resources for Italian, contact the Istituto di
    Linguistica Computazionale - Consiglio Nazionale Ricerche
    (http://www.ilc.cnr.it/indexflash.html) and Fabrizio Sebastiani
    (http://faure.iei.pi.cnr.it/~fabrizio/), from the Istituto di Scienza e
    Tecnologia dell'Informazione - Consiglio Nazionale Ricerche
    (http://www.iei.pi.cnr.it/).

    >Also, some of our prospective users are interested in a categorization
    >scheme that goes beyond topics, further categorizing documents across
    >topics into a small set of genres such as *comments* and *news*. This
    >seems to be a harder task, and we would be interested in work that pursued
    >similar issues.
    >
    >More in general, we would be grateful for any sort of advice/information
    >that seems relevant (e.g., pointers to other text cat work on Italian,
    >etc.)
    >
    >Thanks a lot!
    >
    >Silvia Bernardini, Marco Baroni & Alessandra Volpi
    >SSLMIT, University of Bologna at Forli'
    >Italy

    _______________________________________________________________________________

    Jose Maria Gomez Hidalgo
    Departamento de Inteligencia Artificial
    Universidad Europea de Madrid
    28670 - Villaviciosa de Odon - MADRID
    (+34) 912115670
    jmgomez@dinar.esi.uem.es
    http://www.esi.uem.es/~jmgomez/
    _______________________________________________________________________________

    La legislación española ampara el secreto de las comunicaciones. Este
    correo electrónico es estrictamente confidencial y va dirigido
    exclusivamente a su destinatario/a. Si no es Ud., le rogamos que no difunda
    ni copie la transmisión y nos lo notifique cuanto antes.

    Spanish law guarantees privacy in electronic communications. This
    electronic transmission is strictly confidential and intended solely for
    the addressee. If you are not the intended addressee, you are kindly
    requested not to disclose nor to copy this transmission and to notify us as
    soon as possible.



    This archive was generated by hypermail 2b29 : Mon Jun 16 2003 - 10:54:58 MET DST