Corpora: ELRA new resources

Valerie Mapelli (info-elra@calva.net)
Wed, 19 Nov 1997 18:56:22 +0100 (MET)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: R.Gaizauskas@dcs.shef.ac.uk: "Corpora: RESEARCH POST IN BIOINFORMATICS/NATURAL LANGUAGE ENGINEERING"
Previous message: Jochen Leidner: "Corpora: Evaluating taggers"

[ We apologise for the duplicate posting of this announcement ]

EUROPEAN LANGUAGE RESOURCES ASSOCIATION
ELRA News
=====================================

*** ANNOUCEMENT OF NEW RESOURCES AVAILABLE FROM ELRA ***

ELRA is happy to announce the update of its catalogue
of Language Resources for Language Engineering and Research.

*************************************
* ELRA-S0034 Verbmobil *
*************************************

This resource consists of spontaneous speech recorded in a dialog task
(appointment scheduling). The German corpus has a total of 13,910
utterances (turns). The BAS edition of the German part is fully
labelled and segmented into phonemic/phonetic SAM-PA by the MAUS
system and partly segmented manually.

New corpora available via ELRA (for the complete list, please contact
ELRA or visit ELRA or BAS Web sites):

VM CD 4.0 - VM40 (1 CD-ROM, original edition)
72 Dialogues, 181 Appointments, 1,588 Turns.

VM CD 4.1 - VM41 (1 CD-ROM, new edition)
72 Dialogues 181 Appointments 1,588 Turns
This new edition contains the transliterations of all dialogues, signal
files with PhonDat 2 Header structure, software and speaker
documentation and partitur files*. All files were evaluated
according to BAS guidelines.

VM CD 5.0 - VM50 (1 CD-ROM, original edition)
101 Dialogues, 256 Appointments, 2,154 Turns.

VM CD 5.1 - VM51 (1 CD-ROM, new edition)
101 Dialogues, 256 Appointments 2,154 Turns.
This new edition contains the transliterations of all dialogues, signal
files with PhonDat 2 Header structure, software and speaker
documentation and partitur files*. All files were evaluated
according to
BAS guidelines.

VM CD 6.0 - VM60 (1 CD-ROM, original edition)
American/English and 'Denglish'**. 146 Dialogues, 191
Appointments, 1,828 Turns.

VM CD 6.1 - VM61 (1 CD-ROM, new edition)
American/English and 'Denglish'**. 146 Dialogues, 191 Appointments
1,828 Turns. This new edition contains the transliterations of all
dialogues, signal files with PhonDat 1 Header structure, software and
speaker documentation. All files were evaluated according to BAS
guidelines.

VM CD 7.0 - VM70 (1 CD-ROM, original edition)
68 Dialogues, 238 Appointments, 1,739 Turns.

VM CD 7.1 - VM71 (1 CD-ROM, new edition)
68 Dialogues, 238 Appointments, 1,739 Turns. This new edition
contains the transliterations of all dialogues, signal files with
PhonDat 2 Header structure, software and speaker documentation and
partitur files*. All files were evaluated according to BAS guidelines.

VM CD 8.0 - VM80 (1 CD-ROM, original edition)
American/English 167 Dialogues, 167 Appointments, 1,181 Turns.

VM CD 8.1 - VM81 (1 CD-ROM, new edition)
American/English 167 Dialogues, 167 Appointments, 1,181 Turns.
This new edition contains the transliterations of all dialogues, signal
files with PhonDat 1 Header structure, software and speaker
documentation. All files were evaluated according to BAS guidelines.

VM CD 12.0 - VM120 (1 CD-ROM, original edition)
207 Dialogues, 207 Appointments, 2,154 Turns.

VM CD 12.1 - VM121 (1 CD-ROM, new edition)
207 Dialogues, 207 Appointments, 2,154 Turns. This new edition
contains the transliterations of all dialogues, signal files with
PhonDat 2 Header structure, software and speaker documentation and
partitur files*. All files were evaluated according to BAS guidelines.

VM CD 13.0 - VM13.0 (original edition)
American/English and 'Denglish'** - 90 speakers - 1714 turns -
200 spontaneous dialogues.

VM CD 13.1 - VM13.1 (new edition)
American/English and 'Denglish'** - 90 speakers - 1714 turns -
200 spontaneous dialogues - transliteration.

VM CD 14.0 - VM14.0 (original edition)
97 speakers - 1891 turns - 156 spontaneous dialogues -
transliteration.

VM CD 14.1 - VM14.1 (new edition)
97 speakers - 1891 turns - 156 spontaneous dialogues -
transliteration - PhonDat 2 headers - Partitur Files*.

* partitur files : files describing the different parts which
constitute the corpus - word order, phrase order, etc.
** 'Denglish' : English spoken by Germans.

Price for ELRA members: 76 ECU per CD
Price for non members: 152 ECU per CD

***********************************************
* ELRA-S0044 SPINA Corpus ("Robots Commands") *
***********************************************

This German corpus contains read speech of 22 different speakers (6
male, 16 female). The corpus consists of 10 robot command sentences
and 62 robot command words. Each speaker reads the whole corpus 5
times, except one speaker who reads the sentence corpus 16 times and
the word corpus 51 times. The speakers were recorded at two different
sites in Germany (University of Goettingen, University of Bochum).
The corpus contains a total of 10,810 recorded utterances.

All speakers are within the age of 25-30. Two speakers are non-native
speakers. One file gives information about the speakers (speaker ID,
recording site, sex).

The task for the speaker was to read carefully but fluently. If an
error occurred, the recording was interrupted by the supervisor and
the sentence was repeated. The signal files are raw files without any
header, 16 bit per sample, linear, most significant byte first, 16 kHz
sample frequency.

The orthography of the corpus is given in two distinct files which
contain the prompted words and the prompted sentences as an ordered
list.

The recording conditions are as follows:
Microphone: AKG acoustics, C414B-TL, condensator microphone
omnidirectional, built-in attenuator and high pass filter switched off,
distance to mouth 50 cm.
Environment: Studio Quality, echo cancelled room, about 121 qqm
Preamplifier: John Hardy, M-1
Sampling rate: 48 kHz to DAT recorder, filtered to 16 kHz
Resolution: 16 Bit, most significant byte first

The speech data were digitally filtered to 8 kHz cut-off frequency and
downsampled to 16 kHz.

The corpus consists of 1 volume, total size 266,361 KB uncompressed
data.

The signal of each utterance is stored in a separate file. Symbolic
information like segmentations or labelling (e.g. Phonological
Segmentation of words or Word Segmentation of sentences) are stored
in files with the same prefix but with different extensions.

Price for ELRA members: 76 ECU
Price for non members: 152 ECU

***********************************************************
* ELRA-S0045 German Pronunciation Rules Set - PHONRUL 9.0 *
***********************************************************

PHONRUL is a collection of computer-readable underspecifying
pronunciation rules of standard German. This set describes the most
common known effects in German pronunciation if deviating from the
so-called canonic or citation form of words. The knowledge of this rule
set was derived from empirical analysis of speech corpora as well as
from a multitude of publications about German phonetics. The set does
not contain any dialect-specific rules, however the line between
Standard German and dialects is indistinct. Presently, this rule set is
used at the University of Munich to aid automatic segmentation and
labelling of unknown speech utterances.

The rule set, in its present form, consists of approximately 1,500
complex rules which expand to 5,546 simple replacement rules. The
rule set was designed for extended German SAM-PA, but can be
translated into other alphabets (e.g. Worldbet, IPA) without much
effort.

Price for ELRA members:
o for research use: 76 ECU
o for commercial use: 482 ECU
Price for non members:
o for research use: 152 ECU
o for commercial use: 964 ECU

********************************************
For more information, please contact:
ELRA/ELDA
87, Avenue d'Italie
75013 PARIS
Tel: +33 1 45 86 53 00
Fax: +33 1 45 86 44 88
E-mail: info-elra@calva.net
http://www.icp.grenet.fr/ELRA/home.html
********************************************

Next message: R.Gaizauskas@dcs.shef.ac.uk: "Corpora: RESEARCH POST IN BIOINFORMATICS/NATURAL LANGUAGE ENGINEERING"
Previous message: Jochen Leidner: "Corpora: Evaluating taggers"