European Corpus Initiative corpora available on CD-ROM:

ECI1/MUL06/MSP06/SPA16A: Information technology, EU, 26,000 words
ECI1/SPA02A-J: El Diario Sur, local newspaper from Malaga, belongs to
national publisher, in existence for
40 years. Different writing styles, 500,000 words.
ECI2/MUL04/MSP04A-J: Telecommunication user manual, several 100,000 words

ECI2/MUL09/SPA19A: Xerox ScanWorx user manual, 45,000 words

ECI2/MUL12/MSP12/MSP12A-C: Civil law, Switzerland, 600,000 words
ECI4/SPA03: Minimally processed by ECI; contains errors and duplication
but the CLEAN and FC files are clean
El Diario Vasco, newspaper
CLEAN files, news, few errors, 300,000 words
FC files, 177,000 words

ftp /pub/corpus/argentina 2 million words
/pub/corpus/chile 2 millions words

Fernando Sanchez Leon, Laboratorio de Linguistica Informatica:

The CRATER Project: ITU corpus in the process of postediting. Trilingual

(French/English/Spanish) corpus has more than 3 million words and is the

so-called "White Book on Telecommunications" released by the International
Telecommunications Union. Fernando et al are working with a 1-million word
subcorpus, which will also be postedited. This corpus, along with the tagger
developed for its tagging and all the resources associated with the tagger,
will be in the public domain in October 1995. There is a lexicon with +35,000
words (full forms, not lemmas), part-of-speech annotated, that can be used as
a starting point in lexicon-building tasks.

The national newspaper ABC has just released a CD-ROM with last year's

literary supplement that can be purchased for under $50. +4 million words of
clean, high-quality written text.

Archivo Digital de Manuscritos y Textos Españoles available on CD-ROM. Charles
Faulhaber, Dept. of Spanish & Portuguese, U of California, Berkeley

The EU MULTEXT Project of collecting a corpus which will contain parallel texts

from the European Parliament and financial newspaper articles (Spanish
from Expansion newspaper). Still finalizing licence agreements for these data.

The RELATOR language resources server, supports distribution of NLP resources.
Currently available through RELATOR speech and text corpora, lexicons, NLP
programs and tools, and related databases and systems.

Multilingual Web pages: (XX=two-letter
country codes of the EU countries such as de, uk, etc.) Only speech material.

Briscoe et al paper reports a 17,000-word tagged corpus. (This is all the info
I have on this paper.)

ftp://
Spanish tagger, implemented in Common Lisp. Comes with documentation, works
very well. If you need to install Common Lisp to run it, several good free
implementations at
Sept. 20-23 1995 Nat'l Conference on AI, Morelos, Mexico
Sept. 25-29 1995 11 Congreso de Lenguajes Naturales y Lenguajes Formales, Tortosa, Spain

Tortosa, Spain