New Corpus from the LDC

LDC Office (
Fri, 11 Aug 1995 15:38:48 EDT

Announcing a NEW RELEASE from the

Air Travel Information System

This set of discs contains a corpus of speech and natural language
data collected under the auspices of the Advanced Research Projects
Agency Spoken Language Systems (ARPA-SLS) technology development
program. The corpus, which contains data in the Air Travel Information
Services (ATIS) domain, was designed by the ARPA-SLS Multi-site Atis
Data COllection Working (MADCOW) group and was collected by five sites
at locations across the U.S.:

BBN Systems & Technologies, Cambridge, MA

Carnegie Mellon University, Pittsburgh, PA

MIT Laboratory for Computer Science, Boston, MA

National Institute for Standards and Technology, Gaithersburg, MD

SRI International, Menlo Park, CA

The corpora on this set of discs is part of the third phase of
collection of ATIS data (ATIS3) and comprises the development test
(NIST Speech Disc 17-4.2) and evaluation test material (NIST Speech
Disc 17-5.1) used in the December 1994 ARPA SLS Benchmark Tests. As
in the previous ATIS corpora, the speech contained in this corpus was
elicited by presenting subjects with various hypothetical travel
planning scenarios to solve. The resulting spontaneous spoken queries
were recorded as the subjects interacted with partially or completely
automated ATIS systems to solve the scenarios. Note that the ATIS3
training data is available on NIST Speech Discs 17-1.1-17-3.1.

The recorded speech has been transcribed and annotated with
categorizations and canonical reference answers. All of the
utterances on these discs have been recorded using a close-talking,
noise-cancelling head-mounted Sennheiser microphone. For some
subjects, secondary (noisier) microphone data was recorded
simultaneously as well.

These discs also contain the ATIS3 46 city/52 airport relational
database, a revised Principles of Interpretation, and test
implementation and scoring instructions as well as other general

The ATIS3 corpus has been verified, collated, documented, and produced
on CD-ROM by the National Institute of Standards and Technology
(NIST) in cooperation with MADCOW and distributed by the Linguistic
Data Consortium (LDC).

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL Information is also available via ftp
at under pub/ldc; for ftp access, please use
"anonymous" as your login name, and give your email address when asked
for password.