Corpora: New Corpus from the Linguistic Data Consortium

LDC Office (ldc@unagi.cis.upenn.edu)
Sun, 31 Aug 1997 16:41:31 EDT

Announcing a NEW RELEASE from the
LINGUISTIC DATA CONSORTIUM

The Kids Corpus

This database is comprised of sentences read aloud by children. It
was originally designed in order to create a training set of
children's speech for the SPHINX II automatic speech recognizer for
its use in the LISTEN project at Carnegie Mellon University.

The children range in age from 6 to 11 (see details below) and were in
first through third grades (the 11-year-old was in 6th grade) at the
time of recording. There were 24 male and 52 female speakers.
Although the girls outnumber the boys, we feel that the small
difference in vocal tract length between the two at this age should
make the effect of this imbalance negligible. There are 5180
utterances in all.

The speakers come from two separate populations. Since the LISTEN
reading coach needed good examples of reading aloud, it was decided
that the majority of the speakers should be "good" readers. They were
recorded in the summer of 1995, and were enrolled in either the
Chatham College Summer Camp, or the Mount Lebanon Extended Day Summer
Fun program in Pittsburgh. They were recorded on-site. This set will
hereafter be called SUM95. There are 44 speakers and 3333 utterances
in this set. The LISTEN system also needed examples of errorful
reading and dialectic variants. The readers who supplied this type of
speech come from a school which has a high population of children who
are at risk of growing up poor readers and who could therefore benefit
from any reading tutor or other system built upon this database. They
come from Fort Pitt School in Pittsburgh and were recorded in April
1996. This subset will be referred to as FP. There are 32 speakers
and 1847 utterances in this set. The list of speakers, the set they
are in, and the number of sentences per speaker can be found in the
"tables" directory, in the file named "speaker.tbl".

It should be noted that although there will be some dialectal
variation in the speech of the SUM95 subset, the speech of the FP
subset gives us a very good representation of dialects of the children
that may be targeted for the LISTEN system. However, the user should
be aware that the speakers' dialect partly reflects what is locally
called "Pittsburghese".

The text presented to the children was obtained from Weekly Reader
stories. Weekly Reader is a four-page color reading supplement given
out to children in many classrooms. Special reprint permission
granted by Weekly Reader (R), published by Weekly Reader Corporation
Copyright (c) 1994, 1995 by Weekly Reader Corporation All Rights
Reserved.

Because of restrictions imposed by the copyright holders, this corpus
is available to 1997 LDC members only.

If you would like to order a copy of this corpus, please email your
request to ldc@unagi.cis.upenn.edu. If you need additional information
before placing your order, or would like to inquire about membership
in the LDC, please send email or call (215) 898-0464.

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL
http://www.ldc.upenn.edu/. Information is also available via ftp
at ftp.cis.upenn.edu under pub/ldc; for ftp access, please use
"anonymous" as your login name, and give your email address when asked
for password.