Spanish resources

Alice Carlberger (
Tue, 04 Jul 1995 11:26:31 +0200

Here is the material again. If you have any questions, please let me know=

- Alice


European Corpus Initiative corpora available on CD-ROM:

ECI1/MUL06/MSP06/SPA16A: Information technology, EU, 26,000 words
ECI1/SPA02A-J: El Diario Sur, local newspaper from Malaga, belong=
s to =

national publisher, in existence for
40 years. Different writing styles, 500,000 words.
ECI2/MUL04/MSP04A-J: Telecommunication user manual, several 100,000 =

ECI2/MUL09/SPA19A: Xerox ScanWorx user manual, 45,000 words =

ECI2/MUL12/MSP12/MSP12A-C: Civil law, Switzerland, 600,000 words
ECI4/SPA03: Minimally processed by ECI; contains errors and duplic=
ation =

but the CLEAN and FC files are clean
El Diario Vasco, newspaper
CLEAN files, news, few errors, 300,000 words
FC files, 177,000 words

ftp /pub/corpus/argentina 2 million words=0D
/pub/corpus/chile 2 millions words=0D

Fernando Sanchez Leon, Laboratorio de Linguistica Informatica: =

The CRATER Project: ITU corpus in the process of postediting. Trilingual =

(French/English/Spanish) corpus has more than 3 million words and is the =

so-called "White Book on Telecommunications" released by the Internationa=
l =

Telecommunications Union. Fernando et al are working with a 1-million wor=
d =

subcorpus, which will also be postedited. This corpus, along with the tag=
ger =

developed for its tagging and all the resources associated with the tagge=
r, =

will be in the public domain in October 1995. There is a lexicon with +35=
,000 =

words (full forms, not lemmas), part-of-speech annotated, that can be use=
d as =

a starting point in lexicon-building tasks.

The national newspaper ABC has just released a CD-ROM with last year's =

literary supplement that can be purchased for under $50. +4 million words=
of =

clean, high-quality written text.

Archivo Digital de Manuscritos y Textos Espa=A4oles available on CD-ROM. =
Charles =

Faulhaber, Dept. of Spanish & Portuguese, U of California, Berkeley

The EU MULTEXT Project of collecting a corpus which will contain parallel=

texts from the European Parliament and financial newspaper articles (Span=
ish =

from Expansion newspaper).Still finalizing licence agreements for these d=

The RELATOR language resources server, supports distribution of NLP resou=
rces. =

Currently available through RELATOR speech and text corpora, lexicons, NL=
P =

programs and tools, and related databases and systems. =

Multilingual Web pages: (XX=3Dtwo-l=
etter =

country codes of the EU countries such as de, uk, etc.) Only speech mater=

Briscoe et al paper reports a 17,000-word tagged corpus. (This is all the=
info =

I have on this paper.)

ftp ://
Spanish tagger, implemented in Common Lisp. Comes with documentation, wor=
ks =

very well. If you need to install Common Lisp to run it, several good fre=
e =

implementations at
Sept. 20-23 1995 Nat'l Conference on AI, Morelos, Mexico
Sept. 25-29 1995 11 Congreso de Lenguajes Naturales y Lenguajes Formales,=

Tortosa, Spain