Corpora: ELRA new resources

Valerie Mapelli (info-elra@calva.net)
Thu, 13 Nov 1997 15:10:06 +0100

[ We apologise for the duplicate posting of this announcement ]

EUROPEAN LANGUAGE RESOURCES ASSOCIATION
ELRA News
=====================================

*** ANNOUCEMENT OF NEW RESOURCES AVAILABLE FROM ELRA ***

ELRA is happy to announce the update of its catalogue
of Language resources for Language Engineering and Research.


*************************************
* ELRA-W0015 "Le Monde" Text corpus *
*************************************

Electronic archiving of "Le Monde" articles started on 1 January 1987.
Some 200 articles are added every day, and as of October 1997 the
database contains more than 500,000 articles, making it the biggest
of its
kind for all French daily newspapers.

The corpus is available in an SGML-tagged ASCII text format. Each month
consists of some 10 MB of data (circa 120 MB per year).

Data ranging from 1987 until present date are available through ELRA
(each buyer may purchase up to 5 years of data).

Price for ELRA members (for research use only):
o 1 year: 291 ECU
o 2 years: 581 ECU
o 3 years: 872 ECU
o 4 years: 1163 ECU
o 5 years: 1454 ECU

Price for non members (for research use only):
o 1 year: 378 ECU
o 2 years: 756 ECU
o 3 years: 1134 ECU
o 4 years: 1512 ECU
o 5 years: 1890 ECU

*******************************************
* ELRA-L0029 CELEX Dutch lexical database *
*******************************************

The Dutch CELEX data is derived from R.H. Baayen, R. Piepenbrock & L.
Gulikers, The CELEX Lexical Database (CD-ROM), Release 2, Dutch
Version 3.1, Linguistic Data Consortium, University of Pennsylvania,
Philadelphia, PA, 1995.

Apart from orthographic features, the CELEX database comprises
representations of the phonological, morphological, syntactic and
frequency properties of lemmata. For the Dutch data, frequencies
have been
disambiguated on the basis of the 42.4m Dutch Instituut voor
Nederlandse Lexicologie text corpora.

To make for greater compatibility with other operating systems, the
databases have not been tailored to fit any particular database
management program. Instead, the information is presented in a
series of plain
ASCII files, which can be queried with tools such as AWK and ICON.
Unique
identity numbers allow the linking of information from different files.

This database can be divided into different subsets:
· orthography: with or without diacritics, with or without word
division positions, alternative spellings, number of
letters/syllables;
· phonology: phonetic transcriptions with syllable boundaries or primary
and secondary stress markers, consonant-vowel patterns, number of
phonemes/syllables, alternative pronunciations, frequency per
phonetic syllable within words;
· morphology: division into stems and affixes, flat or hierarchical
representations, stems and their inflections;
· syntax: word class, subcategorisations per word class;
· frequency of the entries: disambiguated for homographic lemmata.

Price for ELRA members:
- for research use, contact ELRA:
- for commercial use:
o Complete set of data: 56182 ECU
o Subset Orthography: 6000 ECU
o Subset Phonology: 12273 ECU
o Subset Morphology (Inflectional): 6000 ECU
o Subset Morphology (Derivational): 13636 ECU
o Subset Syntax: 6000 ECU
o Subset Frequency: 12273 ECU

Price for non members:
- for research use, contact ELRA:
- for commercial use:
o Complete set of data: 93636 ECU
o Subset Orthography: 10000 ECU
o Subset Phonology: 20454 ECU
o Subset Morphology (Inflectional): 10000 ECU
o Subset Morphology (Derivational): 22727 ECU
o Subset Syntax: 10000 ECU
o Subset Frequency: 20454 ECU

********************************************
For more information, please contact:
ELRA/ELDA
87, Avenue d'Italie
75013 PARIS
Tel: +33 1 45 86 53 00
Fax: +33 1 45 86 44 88
E-mail: info-elra@calva.net
http://www.icp.grenet.fr/ELRA/home.html
********************************************