duplicates in ECI's German corpus

Helmut Feldweg (feldweg@sfs.nphil.uni-tuebingen.de)
Fri, 1 Sep 1995 09:49:53 +0200 (MET DST)

The following might be of interest for users of the
Frankfurter Rundschau Corpus:

A number of duplicate articles (approx. 3%) that might distort
quantitative analysis was encountered in the Frankfurter Rundschau
Corpus contained on the Multilingual Corpus 1 CD-ROM of the European
Corpus Initiative.

A list of the duplicate articles and a Unix shell script to remove
the duplicates from the corpus are available from:

http://www.sfs.nphil.uni-tuebingen.de/~feldweg/fr-dups.html

-- Helmut Feldweg
------------------------------------------------------------------------
Seminar f"ur Sprachwissenschaft, Universit"at T"ubingen
Wilhelmstr. 113, D-72074 T"ubingen, Germany
Tel: +49 7071 294279
Fax: +49 7071 550520
E-mail: Helmut.Feldweg@uni-tuebingen.de
feldweg@sfs.nphil.uni-tuebingen.de
------------------------------------------------------------------------