Thanks

Derek Lewis (D.R.Lewis@exeter.ac.uk)
Mon, 1 May 1995 16:42:58 +0200

Thanks to all who responded to my request for information about multilingual tex
t
corpora some weeks ago. For those who are interested, here is a summary of the
results. Any errors or misinformation are mine.

--------------------------------------
GENERAL ADDRESSES TO SCAN:
- LDC material: on ftp.cis.upenn.edu:/pub/ldc
- European Corpus Initiative (ECI) information: on the Edinburgh WWW server
- WWW index at www.ims.uni-stuttgart.de/info/FTPServer.html. (See also below).
- lexical@nmsu.edu
(Thanks to Oliver Christ)

-------------------------------------
INTERSECT: a Parallel Corpus Project

Raphael Salkie,
The Language Centre,
University of Brighton
Falmer, Brighton, BN1 9PH
England.
Email: RMS3@BRIGHTON.AC.UK

The INTERSECT (International Sample of English Contrastive Texts) Project at
Brighton University began in the Spring of 1994. The aim is to construct and an
alyse
a parallel bilingual corpus of French and English written texts, adding other
languages later if resources permit.

So far the corpus contains about 5 megabytes of text in each language. The mate
rial
includes newspaper articles, official documents, instructions for domestic
appliancles, telecommunications, texts from international organisations, modern
fiction, and academic textbooks

(Thanks to to Raphael Salkie)
------------------------------------
THE LINGUA PROJECT in EUROPE is building multilingual corpora for English,
French, Greek and some others, for use in language pedagogy. Contact
laurent.romary@loria.fr

THE MULTEX PROJECT is building tools for multinlingual corpus access, and also a
bunch of sample corpora. Contact veronis@fraix11.univ-aix.fr

THERE IS A SCANDINAVIAN PROJECT to build multilingual
(english/swedish/norwegian/finnish) parallel corpora. Contact
stig.johansson@iba.uio.no

(Thanks to Lou Burnard)
-----------------------------------

THE EUROPEAN SCIENCE FOUNDATION SECOND LANGUAGE ACQUISITION DATA
BANK (ESFSLDB) contains data of transcribed encounters of untutored language
acquisition of adult immigrants. Source languages are Punjabi, Spanish, Finish,
Italian, Turkish, and (Maroccan) Arabic, target languages are English, French,
Swedish, Dutch, and German. You'll find more details in Perdue, Clive (ed.): Adu
lt
Language Acquisition: cross-linguistic perspectives. 2 vols. Cambridge: Cambridg
e
University press 1993.

First language acquistion corpora covering English, French, German, Italian, Spa
nish
available through CHIDES are described in MacWhinney, Brian: The Child Language
Data Exchange System. Hillsdale, NJ, Erlbaums, 1994.

(Thanks to Helmut Feldweg)
-------------------------------------

THE POMPEU FABRA UNIVERSITY, LANGUAGE RESEARCH INSTITUTE (IULA) in
Barcelona is starting to compile written language corpora. The areas to be cover
ed
are law and economics, starting with Catalonian and Spanish languages but to be
expanded in the future to English, French and German).

(Thanks to Jorge Vivaldi Palatresi)

Universidad Pompeu Fabra
Instituto de Linguistica Aplicada
Rambla Santa Monica 32
08002 Barcelona
Spain

Tel. (34-3) 542 23 28
Fax. (34-3) 542 23 21
e-mail: vivaldi@upf.es
------------------------------------
Barbara Derriks is indexing all the existing corpora of dialogues in French (som
e of
them are bilingual). Should be finished within 2 months, I can send the list to
you,
some of these corpora being bilingual.

Barbara Derriks
French Departement
Universiteit Gent-Belgium.

(Thanks to Barbara)
--------------------------------------
THE EUROPEAN CORPUS INITIATIVE MULTILINGUAL CORPUS I

The European Corpus Initiative Multilingual Corpus I (ECI/MCI) CD was made
available in April 1994. ECI was founded to oversee the acquisition and prepara
tion
of a large multilingual corpus and supports existing and projected national and
international efforts to carefully design, collect and publish large-scale multi
lingual
written and spoken corpora.

ECI has produced a multilingual 93 million word corpus covering most of the majo
r
European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and
more. The primary focus in this effort is on textual material of all kinds, inc
luding
transcriptions of spoken material.
In order to obtain a copy of the ECI/MCI CD, you will need to sign the necessary
user
agreements. This, together with a copy of the full listing of files on the CD,
is
obtainable by the following means:

1. anonymous ftp from scott.cogsci.ed.ac.uk/pub/elsnet/eci; or

2. World Wide Web from http://www.cogsci.ed.ac.uk/elsnet/eci.html; or

(Thanks to ECI)
-----------------------------------
THE UNIVERSITY OF SURREY has a number of text corpora in English wuth their
'shadows' in German, Spanish, Dutch, French and Welsh. The corpora range from
10,000 words to 300,000 words and all the corpora are domain or subject specific
.

(Thanks to Khurshid Ahmad)
-------------------------------------
A parallel GERMAN-NORWEGIAN CORPUS. Contact:

Cathrine Fabricius-Hansen
Germanistisk institutt
P.b. 1004, Blindern
N-0315 Oslo
e-mail: c.f.hansen@german.uio.no

(Thanks to Cathrine)

----------------------------------------
Prof. Schmied at the TECHNICAL UNIVERSITY OF CHEMNITZ-ZWICKAU is compiling
a TRANSLATION CORPUS OF ENGLISH AND GERMAN.

Contact:
hildegard.schaeffler@phil.tu-chemnitz.de or
josef.schmied@phil.tu-chemnitz.de
----------------------------------------

THE LINGUISTIC DATA CONSORTIUM (LDC), has a large number of corpora,
including parallel texts.

For membeships and details of releases contact:

ftp.cis.upenn.edu under /pub/ldc.
ftp://ftp.cis.upenn.edu/pub/ldc_www/hpage.html

(Thanks to LDC)
___________________________________________

Derek Lewis
Department of German
University of Exeter
Exeter
UK EX4 4QH
Tel. 01392-264330
Fax. 01392-264377