- Alice
-------------------------------------------------------------------------=
-----
Spanish
European Corpus Initiative corpora available on CD-ROM:
ECI1/MUL06/MSP06/SPA16A: Information technology, EU, 26,000 words
ECI1/SPA02A-J: El Diario Sur, local newspaper from Malaga, belong=
s to =
national publisher, in existence for
40 years. Different writing styles, 500,000 words.
ECI2/MUL04/MSP04A-J: Telecommunication user manual, several 100,000 =
=
words.
ECI2/MUL09/SPA19A: Xerox ScanWorx user manual, 45,000 words =
ECI2/MUL12/MSP12/MSP12A-C: Civil law, Switzerland, 600,000 words
ECI4/SPA03: Minimally processed by ECI; contains errors and duplic=
ation =
but the CLEAN and FC files are clean
(?)
El Diario Vasco, newspaper
CLEAN files, news, few errors, 300,000 words
FC files, 177,000 words
ftp lola.lllf.uam.es /pub/corpus/argentina 2 million words=0D
/pub/corpus/chile 2 millions words=0D
Fernando Sanchez Leon, Laboratorio de Linguistica Informatica: =
The CRATER Project: ITU corpus in the process of postediting. Trilingual =
(French/English/Spanish) corpus has more than 3 million words and is the =
so-called "White Book on Telecommunications" released by the Internationa=
l =
Telecommunications Union. Fernando et al are working with a 1-million wor=
d =
subcorpus, which will also be postedited. This corpus, along with the tag=
ger =
developed for its tagging and all the resources associated with the tagge=
r, =
will be in the public domain in October 1995. There is a lexicon with +35=
,000 =
words (full forms, not lemmas), part-of-speech annotated, that can be use=
d as =
a starting point in lexicon-building tasks.
The national newspaper ABC has just released a CD-ROM with last year's =
literary supplement that can be purchased for under $50. +4 million words=
of =
clean, high-quality written text.
Archivo Digital de Manuscritos y Textos Espa=A4oles available on CD-ROM. =
Charles =
Faulhaber, Dept. of Spanish & Portuguese, U of California, Berkeley
The EU MULTEXT Project of collecting a corpus which will contain parallel=
=
texts from the European Parliament and financial newspaper articles (Span=
ish =
from Expansion newspaper).Still finalizing licence agreements for these d=
ata.
The RELATOR language resources server, supports distribution of NLP resou=
rces. =
Currently available through RELATOR speech and text corpora, lexicons, NL=
P =
programs and tools, and related databases and systems. =
ftp://de.relator.research.ec.org/relator=0D
afs://afs/research.ec.org/projects/relator
Multilingual Web pages: http://www.XX.relator.research.ec.org (XX=3Dtwo-l=
etter =
country codes of the EU countries such as de, uk, etc.) Only speech mater=
ials.=0D
Briscoe et al paper reports a 17,000-word tagged corpus. (This is all the=
info =
I have on this paper.)
ftp ://parcftp.xerox.com/pub/tagger
Spanish tagger, implemented in Common Lisp. Comes with documentation, wor=
ks =
very well. If you need to install Common Lisp to run it, several good fre=
e =
implementations at
http://www.cs.rochester.edu/users/staff/miller/alu.html.
rnia@campus.mor.itesm.mx:
Sept. 20-23 1995 Nat'l Conference on AI, Morelos, Mexico
cmv@astor.urv.es:
Sept. 25-29 1995 11 Congreso de Lenguajes Naturales y Lenguajes Formales,=
=
Tortosa, Spain
-------------------------------------------------------------------------=
------
-.