Re: Spanish corpora resources

Alice Carlberger (
04 Jul 1995 15:23:52 +0200

I received some messages about my list of Spanish corpora
resources not being retrievable, so here it is again. If you have any
questions, please let me know.

- Alice

European Corpus Initiative corpora available on CD-ROM:

ECI1/MUL06/MSP06/SPA16A: Information technology, EU, 26,000 words

ECI1/SPA02A-J: El Diario Sur, local newspaper from Malaga, belongs
to national publisher, in existence for 40 years. Different writing
styles, 500,000 words.

ECI2/MUL04/MSP04A-J: Telecommunication user manual, several 100,000

ECI2/MUL09/SPA19A: Xerox ScanWorx user manual, 45,000 words.

ECI2/MUL12/MSP12/MSP12A-C: Civil law, Switzerland, 600,000 words.

ECI4/SPA03: Minimally processed by ECI; contains errors and
duplication but the CLEAN and FC files seem to be clean.

El Diario Vasco, newspaper
CLEAN files, news, few errors, 300,000 words
FC files, 177,000 words

ftp /pub/corpus/argentina 2 million words
/pub/corpus/chile 2 millions words

Fernando Sanchez Leon, Laboratorio de Linguistica Informatica:
The CRATER Project: ITU corpus in the process of postediting.
Trilingual (French/English/Spanish) corpus has more than 3 million
words and is the so-called "White Book on Telecommunications"
released by the International Telecommunications Union. Fernando et al
are working with a 1-million word subcorpus, which will also be
postedited. This corpus, along with the tagger developed for its
tagging and all the resources associated with the tagger
will be in the public domain in October 1995. There is a lexicon with
+35,000 words (full forms, not lemmas), part-of-speech annotated, that
can be used as a starting point in lexicon-building tasks.

The national newspaper ABC has just released a CD-ROM with last year's
literary supplement that can be purchased for under $50. +4 million
words of clean, high-quality written text.

Archivo Digital de Manuscritos y Textos Espa=A4oles available on
CD-ROM. Charles Faulhaber, Dept. of Spanish & Portuguese, U of
California, Berkeley.

The EU MULTEXT Project of collecting a corpus which will contain
parallel texts from the European Parliament and financial newspaper
articles (Spanish from Expansion newspaper). Still finalizing licence
agreements for these data.

The RELATOR language resources server, supports distribution of NLP
resources. Currently available through RELATOR speech and text
corpora, lexicons, NLP programs and tools, and related databases and

Multilingual Web pages:
(XX=3Dtwo-letter country codes of the EU countries such as de, uk,
etc.) Only speech materials.

Briscoe et al paper reports a 17,000-word tagged corpus. (This is all
the info I have on this paper.)

ftp ://
Spanish tagger, implemented in Common Lisp. Comes with documentation,
works very well. If you need to install Common Lisp to run it, several
good free implementations at
Sept. 20-23 1995 Nat'l Conference on AI, Morelos, Mexico
Sept. 25-29 1995 11 Congreso de Lenguajes Naturales y Lenguajes
Formales,Tortosa, Spain