[Corpora-List] text categorisation - newspaper

From: Silvia Bernardini (silvia@sslmit.unibo.it)
Date: Mon Jun 16 2003 - 10:48:21 MET DST

  • Next message: Jose Maria Gomez Hidalgo: "Re: [Corpora-List] text categorisation - newspaper"

    Dear all,

    We are about to start the categorization of a corpus of Italian newspaper
    text into a set of broad topics (sports, internal affairs, arts, business,
    etc). We plan to follow a standard supervised machine learning approach,
    tagging a subset of the corpus manually, and following the usual
    train/test/classify cycle.

    We would like to find information about other projects concerning the
    categorization of newspaper text -- in particular, we are interested in
    the topic sets that have been used in similar projects. For example, if
    somebody has the list of topics used in the AP text cat collection, and
    could send us a copy, that would be extremely useful.

    Also, some of our prospective users are interested in a categorization
    scheme that goes beyond topics, further categorizing documents across
    topics into a small set of genres such as *comments* and *news*. This
    seems to be a harder task, and we would be interested in work that pursued
    similar issues.

    More in general, we would be grateful for any sort of advice/information
    that seems relevant (e.g., pointers to other text cat work on Italian,
    etc.)

    Thanks a lot!

    Silvia Bernardini, Marco Baroni & Alessandra Volpi
    SSLMIT, University of Bologna at Forli'
    Italy



    This archive was generated by hypermail 2b29 : Mon Jun 16 2003 - 10:13:15 MET DST