Corpora: ELRA News

=?iso-8859-1?Q?Val=E9rie?= Mapelli (mapelli@elda.fr)
Mon, 02 Nov 1998 10:28:28 +0100

[ We apologise for the duplicate posting of this announcement ]

___________________________________________________________
ELRA
European Language Resources Association
ELRA News
___________________________________________________________

*** ELRA NEW RESOURCES ***

We are happy to announce new speech resources available via ELRA:

1) ELRA-S0052 FIXED0IT - Italian Fixed Network Speech (SpeechDat(M)) Corpus
- DB1
2) ELRA-S0053 FIXED0IT - Italian Fixed Network Speech (SpeechDat(M)) Corpus
- DB2
3) ELRA-S0054 Chilean Spanish FDB-250
4) ELRA-S0055 Russian SpeechDat-like FDB-1000
5) ELRA-S0056 Slovenian SpeechDat(II) FDB-1000
6) ELRA-S0057 Shanghai Mandarin FDB-1000
7) ELRA-S0058 RVG1 (Regional Variants of German 1, Part 1)

Below a description of each resource:

1) ELRA-S0052 FIXED0IT - Italian Fixed Network Speech (SpeechDat(M)) Corpus
DB1 Phonetically rich sentences & application oriented utterances

The Italian Fixed Network Speech Corpus version 1.0 was recorded within the
scope of the SpeechDat(M) project (LRE-63314), funded by the European
Commission. Recording was done by using a primary rate ISDN interface,
yielding 8 kHz, 8 bits per sample, A-law coded signal. The data files are
formatted according to the SAM European project. The speech data are
compressed with the GNU gzip program. All software needed to use the corpus
is provided on the CDs.

The corpus contains the speech of about 1000 speakers (about 500 male and
500 female) and was designed to support the creation of voice-driven
teleservices. The callers spoke at least 39 items, comprising:
· isolated and connected digits,
· natural numbers,
· money amounts,
· spelled words,
· time and date phrases,
· yes/no questions,
· city names,
· common application words,
· application words in phrases,
· phonetically rich sentences.
Most items are read, some are spontaneously spoken.

The recordings come with extensive and standardised documentation. All
speech is carefully transcribed at the orthographic level; in addition, a
number of clearly audible non-speech events are included in the
transcription. Moreover, age and regional background of the speakers are
provided. A pronunciation dictionary is added, containing all words that
occur in the corpus, with a corresponding SAMPA broad-class phonemic
transcription.

Validation and premastering of the CD-ROMs were performed by the Speech
Processing Expertise Centre (SPEX), Leidschendam, The Netherlands.

Price for ELRA members:
for research use: 11000 ECU
for commercial use: 14000 ECU

Price for non members:
for research use: 20000 ECU
for commercial use: 20000 ECU
____________________________________________

2) ELRA-S0053 FIXED0IT - Italian Fixed Network Speech (SpeechDat(M)) Corpus
DB2 Phonetically rich sentences sub-set

See ELRA-S0052 for description. DB2 is a sub-set of DB1; it contains only
the phonetically rich sentences items.

Price for ELRA members:
for research use: 8,800 ECU
for commercial use: 14,000 ECU

Price for non members:
for research use: 14,000 ECU
for commercial use: 20,000 ECU
____________________________________________

3) ELRA-S0054 Chilean Spanish FDB-250

This speech database gathers Spanish data as spoken in Chile. All
participants are native speakers. The corpus consists of read speech,
including digits and application words for teleservices, recorded through
an ISDN card. The whole database consists of 6.45 hours of speech, with 24
utterances per speaker. There is a total of 250 speakers (68 male, 80
female, 102 untagged). Except for the 102 untagged speakers, the age class
is divided as follows: 15 speakers are less than 16 year old, 72 speakers
are between age 16 to 30, 44 speakers are between age 31 to 45, and 14
speakers are between age 46 to 60 (and 102 untagged).

The callers spoke 74 different items in total:
· isolated digits,
· yes/no,
· common application words.

The data is provided with orthographic transliteration for all 6,000
utterances including 4 categories of non-speech acoustic events. A phonetic
lexicon with canonical transcription in SAMPA is also included.

The speech files are stored as sequences of 8 bits 8 kHz A-law samples.
Data are stored in a SAM file format.

Price for ELRA members: 5,000 ECU
Price for non members: 7,500 ECU
____________________________________________

4) ELRA-S0055 Russian SpeechDat-like FDB-1000

This speech database gathers Russian data. The corpus consists of read and
spontaneous speech, recorded through an ISDN card, and was validated and
accepted according to the SpeechDat(II) database exchange format. The whole
database consists of 72 hours of speech, with approx. 49 prompted
utterances per speaker. A total of 1000 speakers was recorded (500 male,
500 female). These are native speakers from 5 regions, mainly from Moscow
and St. Petersburg (803 speakers). The speakers age class is divided as
follows: 16 speakers are less than 16 year old, 340 speakers are between
age 16 to 30, 345 speakers are between age 31 to 45, 255 speakers are
between age 46 to 60, and 44 speakers are above age 60.

The callers spoke the following items:
· isolated and connected digits,
· natural numbers,
· money amounts,
· spelled words,
· time and date phrases,
· yes/no,
· city names,
· common application words,
· application words in phrases,
· phonetically rich sentences.

The data is provided with orthographic transliteration for all 48,812
utterances including 4 categories of non-speech acoustic events. A phonetic
lexicon with canonical pronunciation is also provided.

The speech files are stored as sequences of 8 bits 8 kHz A-law samples. The
data is stored in a SAM file format (4 CD-ROMs).

Price for ELRA members: 14,000 ECU
Price for non members: 20,000 ECU
____________________________________________

5) ELRA-S0056 Slovenian SpeechDat(II) FDB-1000

The Slovenian SpeechDat(II) FDB-1000 consists of read and spontaneous
speech, recorded through an ISDN card, and was validated and accepted
according to the SpeechDat(II) database exchange format. The corpus
includes about 1000 speakers (about 500 male and 500 female) who called
over the Slovenian fixed network. All are native speakers of Slovenian from
all dialect regions of Slovenia.

The callers spoke the following items:
· isolated and connected digits,
· natural numbers,
· money amounts,
· spelled words,
· time and date phrases,
· yes/no,
· city names,
· common application words,
· application words in phrases,
· phonetically rich sentences.

The speech files are stored as sequences of 8 bits 8 kHz A-law samples. The
data is stored in a SAM file format (CD-ROMs). A phonetic lexicon with
canonical transcriptions in SAMPA is also provided.

Price for ELRA members: 14,000 ECU
Price for non members: 20,000 ECU
____________________________________________

6) ELRA-S0057 Shanghai Mandarin FDB-1000

This acoustic database gathers Mandarin data, as spoken in Shanghai as a
first or second Chinese dialect/language. The corpus consists of read
speech, including digits and application words for teleservices, recorded
through an ISDN card. A total of 70 utterances was prompted by each
speaker. About 1000 speakers were recorded (500 male, 500 female).

The callers spoke the following items:
· isolated digits,
· yes/no,
· city names,
· common application words and phrases.

The data is provided with Chinese characters and English translation,
canonical Pinyin transcription including tone markers, and several
categories of non-speech events.

The speech files are stored as sequences of 8 bits 8 kHz A-law samples.
Signal and annotation files are stored separately.

Price for ELRA members: 10,000 ECU
Price for non members: 15,000 ECU
____________________________________________

7) ELRA-S0058 RVG1 (Regional Variants of German 1, Part 1)

The corpus consists of single digits, connected digits, phone numbers,
phonetically balanced sentences, computer command phrases and spontaneous
speech. Each speaker has read a subcorpus of 85 items:
· 11 single digits (0-9, with the two pronunciations of 2 (‘zwei’, ‘zwo’)),
· 19 connected digits (10-19, 20-100 in steps of ten),
· 12 computer command phrases,
· 30 phonetically balanced sentences,
· 5 6-digit phone numbers,
· 5 7-digit phone numbers,
· 2 phone numbers with area code,
· 1 minute spontaneous speech (monologue).

The speaker was placed in front of a standard IBM-compatible PC. The
backround noise was limited to the usual noise in office environment, eg.
door slam, backround crosstalk, phone ringing, paper rustle, PC noise, etc.
The head of the speaker is in a range between 2-4 feet to the screen, 1-2
feet from the desktop microphones. The speaker is not forced into a special
position. The speaker is wearing a Sennheiser HD 410 and is free to use the
keyboard or the mouse in front of him. The three desktop microphones are:
Sennheiser MD 441 U, Telex (Soundblaster) and Talk Back (AT&T). Speakers
were selected to achieve the demoscopic density of the German spoken areas
in Europe (including Austria and Switzerland).

The recorded sound samples are stored in NIST SPHERE format. The resolution
is 16 Bits. The sampling frequency is 22.050 Hz except for speakers 001 to
036 which were recorded with 11.025 Hz. Each microphone channel is stored
into a separate file. A transliteration of spontaneous speech according to
Verbmobil Format is also provided.

RVG1, Part 1 contains 197 speakers recorded through 2 microphones.
(RVG1, Part 2, with 303 speakers recorded through 2 microphones will be
available from the beginning of 1999.)

Price for ELRA members:
for research use: 4,949 ECU
for commercial use: 8,198 ECU

Price for non members:
for research use: 5,838 ECU
for commercial use: 9,898 ECU

=====================================
For further information, please contact :

ELRA/ELDA Tel : +33 01 43 13 33 33
55-57 rue Brillat-Savarin Fax : +33 01 43 13 33 30
F-75013 Paris, France E-mail : mapelli@elda.fr

or visit our Web site:

http://www.icp.grenet.fr/ELRA/home.html
=====================================