Corpora: Wanted: corpora for different domains

Noemi Preissner (noemi@CoLi.Uni-SB.DE)
Wed, 25 Feb 1998 14:32:51 +0100 (MET)

Hello everybody,

I am looking for some training material for automatic categorization
of HTML-documents. Therefore, I am especially interested in the fol-
lowing subject matter fields:

- Soccer
- Tennis
- Formula 1
- Heart Diseases / Cardiology
- Allergies
- Dentistry

The size of each of the corpora should be at least 1,000,000 characters
to obtain reasonable results. The categorizer should work for English,
French and German documents, so I am looking for material in all
three languages, not necessarily HTML-documents!

Does anybody know about available corpora (or WWW-sites ... )? A sum-
mary will be posted.

Noemi Preissner
noemi@coli.uni-sb.de