RE: [Corpora-List] text categorisation - newspaper

From: Marina Santini \(Inwind\) (santinim@inwind.it)
Date: Thu Jun 26 2003 - 12:30:31 MET DST

  • Next message: Mcenery, Tony: "[Corpora-List] Is the TEI a waste of time?"

    Dear Silvia, Marco and Alessandra,

    For my PhD project, I'm working on a categorization scheme
    that "goes beyond topic", namely
    I'm involved in text genre categorization on the Web.

    For my master project, I worked on the Italian corpus LE-PAROLE,
    and you can find 2 papers that can be interesting for you:

    Marina Santini, Fattori per i testi, "Italiano e oltre", 2/2003,
    La Nuova Italia, pp. 78-82.

    Marina Santini, Text typology and statistics. Explorations in Italian
    press subgenres, "Italian Journal of Linguistics/Rivista di
    linguistica",
    Volume 13, numero 2, 2001, pp. 339-374.

    I will be glad to give you any further details.

    Good luck

    Marina Santini
    PhD student at ITRI
    (University of Brighton - UK)
    www.itri.brighton.ac.uk

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of Silvia Bernardini
    Sent: 16 June 2003 09:48
    To: corpora@uib.no
    Subject: [Corpora-List] text categorisation - newspaper

    Dear all,

    We are about to start the categorization of a corpus of Italian
    newspaper text into a set of broad topics (sports, internal affairs,
    arts, business, etc). We plan to follow a standard supervised machine
    learning approach, tagging a subset of the corpus manually, and
    following the usual train/test/classify cycle.

    We would like to find information about other projects concerning the
    categorization of newspaper text -- in particular, we are interested in
    the topic sets that have been used in similar projects. For example, if
    somebody has the list of topics used in the AP text cat collection, and
    could send us a copy, that would be extremely useful.

    Also, some of our prospective users are interested in a categorization
    scheme that goes beyond topics, further categorizing documents across
    topics into a small set of genres such as *comments* and *news*. This
    seems to be a harder task, and we would be interested in work that
    pursued similar issues.

    More in general, we would be grateful for any sort of advice/information
    that seems relevant (e.g., pointers to other text cat work on Italian,
    etc.)

    Thanks a lot!

    Silvia Bernardini, Marco Baroni & Alessandra Volpi
    SSLMIT, University of Bologna at Forli'
    Italy



    This archive was generated by hypermail 2b29 : Thu Jun 26 2003 - 12:37:00 MET DST