Corpora: SUMMARY: corpora for different domains

Noemi Preissner (noemi@CoLi.Uni-SB.DE)
Tue, 24 Mar 1998 19:18:24 +0100 (MET)

Hi,

some weeks ago, I posted the following to this mailinglist:

> Hello everybody,
>
> I am looking for some training material for automatic categorization
> of HTML-documents. Therefore, I am especially interested in the fol-
> lowing subject matter fields:
>
> - Soccer
> - Tennis
> - Formula 1
> - Heart Diseases / Cardiology
> - Allergies
> - Dentistry
>
> The size of each of the corpora should be at least 1,000,000 characters
> to obtain reasonable results. The categorizer should work for English,
> French and German documents, so I am looking for material in all
> three languages, not necessarily HTML-documents!
>
> Does anybody know about available corpora (or WWW-sites ... )? A sum-
> mary will be posted.
>
> Noemi Preissner
> noemi@coli.uni-sb.de

Here is the promised summary:

Marc Weeber suggested to use CD-rom systems with abstracts of scien-
tific papers, e.g. MEDLINE for heart disease (and maybe dentistry).

Eric Ringger mentioned a web-site about European Club Soccer:
http://www.z-axis.com/uefa/
(We had some problems with the permissions on this web site though.)

Ted Dunning suggested to use Infoseek's directory of web pages and
to then feed a search engine with phrases from these pages. The In-
foseek Categorization turned out to be REALLY useful for our purpo-
ses.

Dan Melamed mentioned a former posting to the mailinglist in which
the URL for the laws of soccer in English, German, French and Spa-
nish was given:
http://www.fifa2.com/cgi-win/runwin.exe?M2:MREnterSub::67174
Unfortunately, this link seems to be outdated!!

Valerie Mapelli, finally, suggested to have a look at the ELRA
catalogue: http://www.icp.grenet.fr/ELRA/home.html

Thanks a lot for all the help, actually, we now have enough data
for English and still lack material for French and German, so fur-
ther hints are welcome!

Noemi