Spanish resources

Alice Carlberger (alice@speech.kth.se)
Tue, 04 Jul 1995 11:26:31 +0200

Here is the material again. If you have any questions, please let me know=
=2E

- Alice

-------------------------------------------------------------------------=
-----
Spanish

European Corpus Initiative corpora available on CD-ROM:

ECI1/MUL06/MSP06/SPA16A: Information technology, EU, 26,000 words
ECI1/SPA02A-J: El Diario Sur, local newspaper from Malaga, belong=
s to =

national publisher, in existence for
40 years. Different writing styles, 500,000 words.
ECI2/MUL04/MSP04A-J: Telecommunication user manual, several 100,000 =
=

words.
ECI2/MUL09/SPA19A: Xerox ScanWorx user manual, 45,000 words =

ECI2/MUL12/MSP12/MSP12A-C: Civil law, Switzerland, 600,000 words
ECI4/SPA03: Minimally processed by ECI; contains errors and duplic=
ation =

but the CLEAN and FC files are clean
(?)
El Diario Vasco, newspaper
CLEAN files, news, few errors, 300,000 words
FC files, 177,000 words

ftp lola.lllf.uam.es /pub/corpus/argentina 2 million words=0D
/pub/corpus/chile 2 millions words=0D

Fernando Sanchez Leon, Laboratorio de Linguistica Informatica: =

The CRATER Project: ITU corpus in the process of postediting. Trilingual =

(French/English/Spanish) corpus has more than 3 million words and is the =

so-called "White Book on Telecommunications" released by the Internationa=
l =

Telecommunications Union. Fernando et al are working with a 1-million wor=
d =

subcorpus, which will also be postedited. This corpus, along with the tag=
ger =

developed for its tagging and all the resources associated with the tagge=
r, =

will be in the public domain in October 1995. There is a lexicon with +35=
,000 =

words (full forms, not lemmas), part-of-speech annotated, that can be use=
d as =

a starting point in lexicon-building tasks.

The national newspaper ABC has just released a CD-ROM with last year's =

literary supplement that can be purchased for under $50. +4 million words=
of =

clean, high-quality written text.

Archivo Digital de Manuscritos y Textos Espa=A4oles available on CD-ROM. =
Charles =

Faulhaber, Dept. of Spanish & Portuguese, U of California, Berkeley

The EU MULTEXT Project of collecting a corpus which will contain parallel=
=

texts from the European Parliament and financial newspaper articles (Span=
ish =

from Expansion newspaper).Still finalizing licence agreements for these d=
ata.

The RELATOR language resources server, supports distribution of NLP resou=
rces. =

Currently available through RELATOR speech and text corpora, lexicons, NL=
P =

programs and tools, and related databases and systems. =

ftp://de.relator.research.ec.org/relator=0D
afs://afs/research.ec.org/projects/relator

Multilingual Web pages: http://www.XX.relator.research.ec.org (XX=3Dtwo-l=
etter =

country codes of the EU countries such as de, uk, etc.) Only speech mater=
ials.=0D

Briscoe et al paper reports a 17,000-word tagged corpus. (This is all the=
info =

I have on this paper.)

ftp ://parcftp.xerox.com/pub/tagger
Spanish tagger, implemented in Common Lisp. Comes with documentation, wor=
ks =

very well. If you need to install Common Lisp to run it, several good fre=
e =

implementations at
http://www.cs.rochester.edu/users/staff/miller/alu.html.

rnia@campus.mor.itesm.mx:
Sept. 20-23 1995 Nat'l Conference on AI, Morelos, Mexico

cmv@astor.urv.es:
Sept. 25-29 1995 11 Congreso de Lenguajes Naturales y Lenguajes Formales,=
=

Tortosa, Spain
-------------------------------------------------------------------------=
------
-.