[Fwd: [Corpora-List] text categorisation - newspaper] (fwd)

From: Carl Lewis Sable (sable@cs.columbia.edu)
Date: Mon Jun 16 2003 - 17:37:27 MET DST

  • Next message: cyrille: "[Corpora-List] list of stopwords for french"

    Hi,

    A friend of mine forwarded your message below. You will likely be
    interested in our Newsblaster project, which is available on the web at
    http://newsblaster.cs.columbia.edu. Every night, Newsblaster
    automatically crawls many popular news sites in search of what it thinks
    are News articles. It automatically clusters articles into groups such
    that every article within a single group is thought to discuss the same
    event. A summary is automatically generated for each event. Also, and I
    think this directly relates to what you ask for below, each cluster of
    News articles is automatically categorized into one of the categories
    "U.S. News", "World News", "Entertainment", "Sports", "Finance", or
    "Sci/Tech". This was my part of the project; we use an approach I call
    BINS, which can be thought of as a generalization of Naive Bayes that
    computes word weights for groups of words sharing statistical features in
    common (as opposed to individual words like regular Naive Bayes). Our
    accuracy is very high, I believe over 90\% and maybe as high as 95\%.
    See for yourself!

    In addition to Newsblater, I also created a corpus that I used for my own
    work, involving news articles with embedded images from a variety of
    Usenet newsgroups, and I have defined several sets of categories. One
    data set that applies to the news articles specifically involves the
    categories "Politics", "Struggle", "Crime", "Disaster" or "Other", defined
    to be mutually exclusive. I hope to soon make this corpus publicly
    available. When this happens, instructions to download the corpus will be
    posted at:

    http://www1.cs.columbia.edu/~sable/research/corpus.html

    -Carl

    ---------- Forwarded message ----------
    Date: Mon, 16 Jun 2003 09:41:46 -0400
    From: David Evans <devans@cs.columbia.edu>
    To: Carl Sable <sable@cs.columbia.edu>
    Subject: [Fwd: [Corpora-List] text categorisation - newspaper]

    hey carl,

       are you interested in getting stuff like this? I'm on the corpora
    list, and thought you might have an interest...

    dave

    -------- Original Message --------
    Subject: [Corpora-List] text categorisation - newspaper
    Date: Mon, 16 Jun 2003 09:48:21 +0100
    From: Silvia Bernardini <silvia@sslmit.unibo.it>
    To: <corpora@uib.no>

    Dear all,

    We are about to start the categorization of a corpus of Italian newspaper
    text into a set of broad topics (sports, internal affairs, arts, business,
    etc). We plan to follow a standard supervised machine learning approach,
    tagging a subset of the corpus manually, and following the usual
    train/test/classify cycle.

    We would like to find information about other projects concerning the
    categorization of newspaper text -- in particular, we are interested in
    the topic sets that have been used in similar projects. For example, if
    somebody has the list of topics used in the AP text cat collection, and
    could send us a copy, that would be extremely useful.

    Also, some of our prospective users are interested in a categorization
    scheme that goes beyond topics, further categorizing documents across
    topics into a small set of genres such as *comments* and *news*. This
    seems to be a harder task, and we would be interested in work that pursued
    similar issues.

    More in general, we would be grateful for any sort of advice/information
    that seems relevant (e.g., pointers to other text cat work on Italian,
    etc.)

    Thanks a lot!

    Silvia Bernardini, Marco Baroni & Alessandra Volpi
    SSLMIT, University of Bologna at Forli'
    Italy



    This archive was generated by hypermail 2b29 : Mon Jun 16 2003 - 21:33:07 MET DST