Corpora: Low frequency categories

Jose Maria Gomez Hidalgo (jmgomez@dinar.esi.uem.es)
Wed, 10 Mar 1999 18:15:24 +0100

I apologize for cross-postings of this message
----------------------------------------------

Dear colleagues

I am looking for statistics on use and frequency of categories (how often
they are used and how many documents are classified into them). I am
specially interested on statistics that show whether categories with very
few assigned documents are (frequently) used to browse text databases. In
our view, even if these low frequency categories are difficult to train,
the fact that they exist as categories means that they are necessary, and
therefore, they cannot be ignored. We would like to test this intuition
empirically.

My research group [1] is working on integrating external knowledge into
supervised learning for the problem of text categorization [2]. Our
approach is based on the assumption that, when few example documents are
available for a category, the use of external information can improve the
classification of new documents into the category.

Many thanks in advance

[1] Laboratory of Intelligent Information Access Systems, Universidad
Europea de Madrid, http://www.esi.uem.es/laboratorios/sinai/

[2] Buenaga, M., Gómez Hidalgo, J., Díaz Agudo, B., Using WordNet to
Complement Training Information in Text Categorization, 2nd International
Conference on Recent Advances in Natural Language Processing (RANLP),
Tzigov Chark, Bulgaria, Sept. 1997.

_____________________________________________________________________________

Jose Maria Gomez Hidalgo
Departamento de Inteligencia Artificial
Universidad Europea de Madrid - CEES
28670 - Villaviciosa de Odon - MADRID Tfno: (91) 616 94 00 Ext. 670
e-mail: jmgomez@dinar.esi.uem.es WWW: http://www.esi.uem.es/~jmgomez/
_____________________________________________________________________________