Corpora: ELRA Focus - MLCC Multilingual Corpora for Co-operation

Valerie Mapelli (info-elra@calva.net)
Mon, 6 Apr 1998 13:49:28 +0200 (MET DST)


[ We apologise for the duplicate posting of this announcement ]

EUROPEAN LANGUAGE RESOURCES ASSOCIATION
ELRA Focus
=====================================


MLCC Multilingual Corpora for Co-operation

A collection of newspaper articles from financial newspapers
in 6 languages (Dutch, English, French, German, Italian and Spanish)
and a set of parallel texts in the 9 European Union
official languages (as of 1993)
=====================================

The current catalogue of ELRA consists of more than 500 language resources (!)
available for speech, written or terminology works. This electronic message
aims to remind of the availability of one of them, namely the MLCC Multilingual
Corpora for Co-operation.

The MLCC text corpus has two main components - one set to allow comparable
studies to be carried out in different languages and one set as the basis for
translation studies.

The first set is referred as the Polylingual Document Collection (ELRA-W0006),
a collection of newspaper articles from financial newspapers in 6 languages
(Dutch, English, French, German, Italian and Spanish). It consists of the
following sub-corpora:

· Dutch - "Het Financieele Dagblad" - 1992-1993
The corpus contains articles from the Dutch financial newspaper "Het
Financieele Dagblad" editions of 2nd January 1992 through to 24th December
1993. It contains around 8.5 million words of text.

· English - "The Financial Times" - 1993
The corpus contains articles from the British financial newspaper "The
Financial Times" editions from the year 1993. The corpus contains around
30 million words.

· French - "Le Monde" - 1992-1993
A corpus of articles from the French newspaper "Le Monde", consisting of
two years worth (1992-1993) of articles on financial subjects,
approximately 10 million words.

· German - "Handelsblatt" - 1986-1988
This subcorpus consists of articles from the period 02.01.1986 to
15.06.1988. It contains some 33 million words. It may be possible to
obtain more recent articles from "Handelsblatt".

· Italian - "Il Sole 24 Ore" - 1992-1993
The corpus described here contains articles from the Italian financial
newspaper "Il Sole 24 Ore" from the year 1992. This corpus contains some
1.88 million words. The SGML-markup was done by the University of
Edinburgh.

· Spanish - "Expansion" - 1994
This subcorpus contains articles from the Spanish financial newspaper
"Expansion" editions from 21.10.1991 to 24.10.1991 and 14.05.1994 to
27.12.1994. It contains some 10 million words.

Price for ELRA members:
for research use: 360 ECU
for commercial use: 1500 ECU

Price for non-members:
for research use: 750 ECU
for commercial use: 3200 ECU

The second set is a Multilingual Parallel Corpus (ELRA-W0007) consisting of
translated data in nine European languages: Danish, Dutch, English, French,
German, Greek, Italian, Portuguese and Spanish. The parallel data, provided
by the European Commission, comprises two sub-corpora from the Official
Journal of the European Communities:

· Official Journal of the European Commission, C Series:
Written Questions 1993
Records of questions and answers regarding European Community matters.
The data is regularly published as one section of the C Series of the
Official Journal of the European Community in all official languages
(previously nine). This corpus contains written questions asked by
members of the European Parliament and corresponding answers from
the European Commission in 9 parallel versions. The total size of the
corpus is approximately 10.2 million words (ca. 1.1 million words
per language).

· Official Journal of the European Commission, Annex: Debates of the European
Parliament 1992-1994
This parallel corpus is the records of Parliamentary sitting published
as an annex to the Official Journal of the European Community Debates
of the European Parliament. The Parliamentary Debates are a record of
what was said by members of the meeting as well as written input provided
to the meeting. The original data from which the translations are produced
consist of a transcript of the sittings, each member speaking in the
language of his choice. The final version consists of nine parallel
versions of the material. The texts delivered comprise the Debates of
Parliament from January 1992 to July 1994. This sub-corpus contains some
5 to 8 million words per language.

Price for ELRA members:
for research use: 120 ECU
for commercial use: 480 ECU
Price for non-members:
for research use: 200 ECU
for commercial use: 800 ECU

********************************************
For more information, please contact:
ELRA/ELDA
55-57 rue Brillat Savarin
75013 PARIS
Tel: +33 1 43 13 33 33
Fax: +33 1 43 13 33 30
E-mail: info-elra@calva.net
http://www.icp.grenet.fr/ELRA/home.html
********************************************