Corpora: Re: Unsupervised learning and low frequency categories

dunja (dunja.mladenic@ijs.si)
Thu, 11 Mar 1999 11:32:45 +0100

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Dov Gabbay: "Corpora: WORKSHOP ON PROOF THEORY ESSLLI 99 FINAL REMINDER"
Previous message: British National Corpus: "Corpora: BNC Sampler now available"

> I would like to know about attempts to build classifiers through
> unsupervised learning, or to integrate other information sources in a
> supervised learning-based classifier. The only one I am aware of is the one
> by Yang and Chute
> [1].
> I am looking for statistics on use and frequency of categories (how often
> they are used and how many documents are classified into them). I am
> specially interested on statistics that show whether categories with very
> few assigned documents are (frequently) used to browse text databases. In
> our view, even if these low frequency categories are difficult to train,
> the fact that they exist as categories means that they are necessary, and
> therefore, they cannot be ignored. We would like to test this intuition
> empirically.

We're using supervised learning with incorporated phrases
(word sequences) and training a set of classifiers,
one for each of the Yahoo categories (also the small ones).
What we obtain is automatic document categorization based
on an existing document hierarchy.

You might want to check my PhD thesis
http://www.cs.cmu.edu/~TextLearning/pww/PhD.html
our project page http://www-ai.ijs.si/DunjaMladenic/pww.html
and demo of automatic docuemnt categorization
http://www-ai.ijs.si/DunjaMladenic/yplanet.html

Regards,
Dunja Mladenic

http://www-ai.ijs.si/DunjaMladenic

Next message: Dov Gabbay: "Corpora: WORKSHOP ON PROOF THEORY ESSLLI 99 FINAL REMINDER"
Previous message: British National Corpus: "Corpora: BNC Sampler now available"