Re: [Corpora-List] text categorisation - newspaper

From: Luisa Bentivogli (bentivo@itc.it)
Date: Mon Jun 23 2003 - 16:27:34 MET DST

  • Next message: Luisa Bentivogli: "Re: [Corpora-List] text categorisation - newspaper"

    Silvia Bernardini wrote:

    > We would like to find information about other projects concerning the
    > categorization of newspaper text -- in particular, we are interested in
    > the topic sets that have been used in similar projects. For example, if
    > somebody has the list of topics used in the AP text cat collection, and
    > could send us a copy, that would be extremely useful.

    Here at ITC-irst we are creating the MEANING Italian Corpus (MIC), a 150
    million word corpus of written contemporary Italian developed with the aim of
    supporting domain-based Word Sense Disambiguation. The MIC is composed of
    newspaper articles, press agency news, and web documents and its novelty
    consists in the fact that domain-representativeness is the fundamental
    criterion for text selection.

    The topic set used is that of WordNet-Domains. WN-DOMAINS is an extension of
    WordNet 1.6 where each synset has been annotated with at least one domain
    label, selected from a set of 164 labels hierarchically organized. WN-Domains
    is currently used within the Natural Language Processing community for
    different tasks, such as word sense disambiguation and text categorization.
    The WN-Domains hierarchy was created starting from the subject field codes
    used by current dictionaries, and the Dewey Decimal Classification system
    (DDC), which is the most widely used library classification system in the
    world and provides a very large and complete set of hierarchically structured
    domain labels.

    A core set of 42 basic domains (the second level of the WN-Domains hierarchy)
    has been chosen to be represented in the MIC. The list of domains can be
    found at
    http://tcc.itc.it/research/textec/topics/acquisition-resources/WN-DOMAINS.txt

    while for more information about WN-DOMAINS you can visit
    http://wndomains.itc.it/

    You could also be interested in the NERC report (see EAGLES Recommendations
    on Text Typology at
    http://www.ilc.cnr.it/EAGLES96/texttyp/node37.html),
    which offers a summary of the classification systems used by major corpus
    projects in Europe. The MIC is in line with the European trend in corpus
    practice as most of the commonly used topics reported in that document
    correspond to our basic domains.

    All th best,

    Luisa Bentivogli

    --
    Luisa Bentivogli -  bentivo@itc.it
    Centro per la Ricerca Scientifica e Tecnologica
    Via Sommarive, 18  38050 Povo - Trento ITALY
    Tel: +39-0461-314-574  Fax: +39-0461-302-040
    http://tcc.itc.it/people/bentivogli.html
    



    This archive was generated by hypermail 2b29 : Tue Jun 24 2003 - 09:41:46 MET DST