Re: Corpora: Low frequency categories

Ted E. Dunning (ted@aptex.com)
Fri, 12 Mar 1999 15:21:19 -0800 (PST)

I can't release hard data here, but our experience with commercial
grade text categorization across a wide range of clients is that large
ontologies of categories generally exist more for marketing reasons
than for practical reasons.

Whether you are talking about web browsing or message categorization,
the most common categories carry the great majority of the traffic.

Neither of these results is very surprising. The people who build
ontologies have generally built these ontologies to sell. Bigger
sounds better. Thus people build bigger ontologies, often in advance
of finding out if any content ever falls into many of the categories.
In addition, the reason that ontology builders have a job is because
the ontology isn't big enough yet. If the ontology is already too
big, then they could perceive that they have less value.

The reason that most documents fall into a relatively small number of
categories is just another application of Zipf's law: there are a
small number of high traffic categories and a bunch of low traffic
categories.

>>>>> "jmgh" == Jose Maria Gomez Hidalgo <jmgomez@dinar.esi.uem.es> writes:

jmgh> I am looking for statistics on use and frequency of
jmgh> categories (how often they are used and how many documents
jmgh> are classified into them). I am specially interested on
jmgh> statistics that show whether categories with very few
jmgh> assigned documents are (frequently) used to browse text
jmgh> databases. In our view, even if these low frequency
jmgh> categories are difficult to train, the fact that they exist
jmgh> as categories means that they are necessary, and therefore,
jmgh> they cannot be ignored. We would like to test this intuition
jmgh> empirically.