Corpora: New Corpus from the Linguistic Data Consortium

LDC Office (ldc@unagi.cis.upenn.edu)
Wed, 27 Aug 1997 20:06:32 EDT

Announcing a NEW RELEASE from the
LINGUISTIC DATA CONSORTIUM

Boston University Radio Speech Corpus

The Boston University Radio Speech Corpus was collected by Mari
Ostendorf of Boston University, primarily to support research in
text-to-speech synthesis, particularly generation of prosodic
patterns. The corpus consists of professionally read radio news data,
including speech and accompanying annotations, suitable for speech and
language research.

The corpus includes speech from seven (4 male, 3 female) FM radio news
announcers associated with WBUR, a public radio station. The main
radio news portion of the corpus consists of over seven hours of news
stories recorded in the WBUR radio studio during broadcasts over a two
year period. In addition, the announcers were also recorded in a
laboratory at Boston University. In this, the lab news portion, the
announcers read a total of 24 stories from the radio news portion.
The announcers were first asked to read the stories in their non-radio
style and then, 30 minutes later. to read the same stories in their
radio style.

Each story read by an announcer was digitized in paragraph size
units, which typically include several sentences. The files were
digitized at a 16k Hz sample rate using a 16 bit A/D. The paragraphs
were annotated with the orthographic transcription, phonetic
alignments, part-of-speech tags and prosodic markers. The
orthographic transcripts were generated by hand and include
indication of where the speaker took a breath. The phonetic
alignments and part-of-speech tags were generated automatically and
hand corrected. The prosodic labels were marked by hand and are
available only for a subset of the corpus.

Institutions that have membership in the LDC for either the 1996 or
1997 Membership Year will be able to receive the BU Radio Corpus
at no additional charge, in the same manner as all other speech
corpora published by the LDC.

Nonmembers can receive a copy of this corpus for research purposes
only for a fee of US$400. If you would like to order a copy of this
corpus, please email your request to ldc@unagi.cis.upenn.edu. If you
need additional information before placing your order, or would like
to inquire about membership in the LDC, please send email or call
(215) 898-0464.

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL
http://www.ldc.upenn.edu/. Information is also available via ftp at
ftp.cis.upenn.edu under pub/ldc; for ftp access, please use
"anonymous" as your login name, and give your email address when
asked for password.