Re: Corpora: Wanted: corpora for different domains

Dan Melamed (melamed@unagi.cis.upenn.edu)
Thu, 26 Feb 1998 13:18:49 -0500 (EST)

>
> Hello everybody,
>
> I am looking for some training material for automatic categorization
> of HTML-documents. Therefore, I am especially interested in the fol-
> lowing subject matter fields:
>
> - Soccer
[...]
>
> The size of each of the corpora should be at least 1,000,000 characters
> to obtain reasonable results. The categorizer should work for English,
> French and German documents, so I am looking for material in all
> three languages, not necessarily HTML-documents!

This previous post to Corpora may get you started, although these
resources are not as large as you want:

Dan

------> README.gz <------
Resent-From: corpora-request@lists.uib.no
Resent-Message-Id: <199708180930.FAA13513@unagi.cis.upenn.edu>
Old-Received: from nora.hd.uib.no by noralf.uib.no with SMTP (PP); Mon, 18 Aug
1997 11:17:54 +0200
Old-Received: from jupiter.brighton.ac.uk (jupiter.bton.ac.uk [192.173.128.24])
by nora.hd.uib.no (8.8.3/8.7.3) with SMTP id LAA28513 for
<CORPORA@hd.uib.no>; Mon, 18 Aug 1997 11:21:14 +0200 (MET DST)
Old-Received: from alpha2.bton.ac.uk by jupiter with SMTP (MMTA); Mon, 18 Aug
1997 10:17:33 +0100
Old-Received: from localhost by alpha2.bton.ac.uk;
(5.65v3.2/1.1.8.2/28Jul95-0212PM) id AA29716; Mon, 18 Aug 1997
10:17:32 +0100
Date: Mon, 18 Aug 1997 10:17:32 +0100 (BST)
Reply-To: Raphael Salkie <R.M.Salkie@bton.ac.uk>
From: Raphael Salkie <R.M.Salkie@bton.ac.uk>
To: CORPORA@hd.uib.no
Subject: Corpora: Football text in 4 languages
Sender: owner-corpora@lists.uib.no
Precedence: bulk
Resent-Date: Mon, 18 Aug 1997 11:17:58 +0200

The laws of football (soccer) can be downloaded in English, German, French
and Spanish from the FIFA website:

http://www.fifa2.com/cgi-win/runwin.exe?M2:MREnterSub::67174

The English version is about 10,000 words.

- Raphael Salkie.