Corpora: New Corpus from the Linguistic Data Consortium

LDC Office (ldc@unagi.cis.upenn.edu)
Sun, 31 Aug 1997 13:43:49 EDT

Announcing a NEW RELEASE from the
LINGUISTIC DATA CONSORTIUM

SWITCHBOARD-1 Release 2

The Switchboard-1 Telephone Speech Corpus was originally collected by
Texas Instruments in 1990-1, under DARPA sponsorship. The first
release of the corpus was published by NIST and distributed by the LDC
in 1992-3. Since that release, a number of corrections have been made
to the data files as presented on the original CD-ROM set, and all
copies of the first pressing have been distributed.

SWITCHBOARD is a collection of about 2400 two-sided telephone
conversations among 543 speakers (302 male, 241 female) from all areas
of the United States. A computer-driven "robot operator" system
handled the calls, giving the caller appropriate recorded prompts,
selecting and dialing another person (the callee) to take part in a
conversation, introducing a topic for discussion, and recording the
speech from the two subjects into separate channels until the
conversation was finished. About 70 topics were provided, of which
about 50 were used frequently. Selection of topics and callees was
constrained so that: (1) no two speakers would converse together more
than once, and (2) no one spoke more than once on a given topic.

In this new release, assembled and published by the LDC, all known
errors affecting the original publication of speech files have been
corrected. In addition, modifications have been made to the contents
of the NIST Sphere headers of all speech files, to identify each file
as being part of the new release, and to make the usage of the
"sample_count" header field consistent with standard Sphere usage. (In
particular, the "sample_count" field should reflect the number of
samples on each channel in the file. In the initial release, this
field was improperly set to be the total number of samples in both
channels of the file; this has been corrected in the new release.)

SWITCHBOARD-1 Release 2 is distributed in a notebook-style binder with
23 CD-ROMs. The intermediate version of the corresponding transcripts
is available separately.

Institutions that have membership in the LDC during the 1997
Membership Year will be able to receive SWITCHBOARD-1 Release 2 at no
additional charge, in the same manner as all other text and speech
corpora published by the LDC.

Nonmembers can receive a copy of SWITCHBOARD-1 Release 2 for research
purposes only for a fee of $10,000. If you would like to order a copy
of this corpus, please email your request to
ldc@unagi.cis.upenn.edu. If you need additional information before
placing your order, or would like to inquire about membership in the
LDC, please send email or call (215) 898-0464.

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL
http://www.ldc.upenn.edu/. Information is also available via ftp
at ftp.cis.upenn.edu under pub/ldc; for ftp access, please use
"anonymous" as your login name, and give your email address when asked
for password.