New Large Corpora from the LDC

LDC Office (ldc@pine.ling.upenn.edu)
Thu, 23 Mar 1995 10:38:42 EST

The Linguistic Data Consortium (LDC), a nonprofit membership organization
affiliated with the University of Pennsylvania, will add about 20 new releases
to its 48 existing speech, text, and lexical databases during the current 1995
membership year. The new releases will feature text corpora in six languages,
French-English parallel texts, a major telephone speech corpus, and new addition
s
to the existing ARPA speech recognition and spoken language understanding series
.
Lexicons and large speech corpora in several languages are also in development
and scheduled for release in six to nine months.

Consortium membership is annual, with the membership year (MY) running from
September to August. Each LDC corpus is identified by the MY of its release,
and the annual membership fee purchases a permanent paid-up license to that
MY's releases, except that some corpora, owned by others and distributed by
LDC, may require a separate user agreement and/or charges.

Members receive one copy of each requested LDC corpus free, and extra copies at
a small charge. Nonmember prices are shown in the tables below. Items marked "
MO" are for members only, due to restrictions by the copyright owners.

Detailed information about the LDC and a catalog describing its holdings are
available via ftp or the World Wide Web (see below); the following is a summary
listing of the database titles by year of release.

PLANNED 1995 RELEASES (TENTATIVE)

Nonmember # of
Catalog#/Est.
Price CDs Title or Description Release
Date

$2500 1 KING Speaker Verification LDC95S22
5000 2 Hansard French/English May 1995
MO 3 CSR-III Speech: Dev and Eval Data LDC9523
MO 4 CSR-III Text: Language Models LDC95T24
2000 2 LATINO-40 Spanish Read News Corpus April 19
95
2000 6 WSJCAM0: Cambridge Read News Corpus LDC95S22
5000 5 PHONEBOOK: NYNEX Isolated Words April 19
95
2500 5 TRAINS spoken dialogs corpus May 1995
2000 6 Corpus of Spoken American English-1 July 199
5
2000 1 TIPSTER Volume 4 Spring 1
995
2500 1 Treebank-2 March 19
95
MO 1 Spanish News Text Collection April 19
95
MO 2 North American Business News Text May 1995
MO 1 Japanese Business News Text June 199
5
2500 1 Mandarin News Text May 1995
MO 1 French Newspaper Text August 1
995
MO 1 North American Newspaper Text August 1
995
500 1 Speech Collection Interface SW June 199
5

PLANNED 1996 RELEASES (TENTATIVE)

TBA 14 JEIDA Japanese Speech Data Summer 1
996
TBA 12 Corpus of Spoken Amer English-2,3 1
996
TBA 1 Mandarin Lexicon Fall 1
995
TBA 1 Spanish Lexicon Fall 1
995
TBA 1 Japanese Lexicon Fall 1
995
TBA 1 English Language International News Fall 1
995
TBA 3 Legal Text (500 M words) Winter 1
996
TBA 6 POLYPHONE-II (American Spanish) Fall 1995
TBA 2 Mandarin Telephone Speech Winter 1
996
TBA 2 Japanese Telephone Speech Winter 1
996
TBA 2 Spanish Telephone Speech Winter 1
996
TBA 6 CALLFRIEND Language ID Corpus Winter 1996
TBA 15 SWITCHBOARD (Revised) TBA

1993 RELEASES
Nonmember #of
Price Disks Title
LDC Catalog No.

$ 100 1 TIMIT LDC93S1
250 2 NTIMIT LDC93S2
750 6 Resource Management Complete LDC93S3A
1000 6 ATIS0 Complete Set LDC93S4A
2000 4 ATIS2 LDC93S5
2000 15 CSR-I (WSJ0) Complete LDC93S6A
10000 28 SWITCHBOARD LDC93S7
1000 1 SWITCHBOARD Credit Card LDC93S8
125 1 TI 46-Word LDC93S9
250 3 TIDIGITS LDC93S10
250 1 Road Rally LDC93S11
200 8 HCRC Map Task Corpus LDC93S12
25 1 ACL/DCI LDC93T1
1000 1 TIPSTER Volume 1 LDC93T3-1
.1
1000 1 TIPSTER Volume 2 LDC93T3-2
.1
1000 1 TIPSTER Volume 3 LDC93T3-3
.1

1994 RELEASES
Nonmember # of
Price Disks Title
LDC Catalog No.

10000 34 CSR-II (WSJ1) Complete LDC94S1
3A
5000 19 CSR-II (WSJ1) Sennheiser LDC94S1
3B
5000 20 CSR-II (WSJ1) Other LDC94S1
3C
2500 8 Air Traffic Control LDC94S1
4
2500 2 SPIDRE LDC94S1
5
1000 1 YOHO Speaker Verification LDC94S1
6
200 1 OGI Multilanguage Corpus LDC94S1
7
100 1 OGI Spelled & Spoken Word LDC94S1
8
5000 3 ATIS3 LDC94S1
9
2500 9 BRAMSHILL LDC94S2
0
10000 7 MACROPHONE (American English) LDC94S2
1
5000 3 UN Parallel Text (Complete) LDC94T4
A
2500 1 UN Parallel Text (English LDC94T4
B-1
2500 1 UN Parallel Text (French) LDC94T4
B-2
2500 1 UN Parallel Text (Spanish)
LDC94T4B-3.1
35 1 ECI Multilingual Text LDC94T5
150 1 CELEX Lexical Database LDC94L1
10000 1 COMLEX English Syntax Lexicon, Version 0 LDC94L2
10000 1 COMLEX Pronouncing Dictionary, Version 0 LDC94L3

PRICES AND CONDITIONS OF PURCHASE

The following are the procedures and conditions for obtaining corpora from the L
DC:

For LDC Members:

Membership fees for commercial organizations are $20,000 per year; fees for non-
profit organizations and government agencies are $2,000 per year. Commercial
members receive commercial rights to all resources, except where restricted by
the original copyright holders.

Notices are mailed to all members when new data sets are available. When corpora
are re-issued in revised, enhanced, or supplemented form, unless the reason is
defective materials, they will be distributed to all those whose LDC membership
is
current at the time of re-issue.

For Nonmembers:

With the exception of items marked ``Members Only'' (MO), nonmembers may purchas
e
single copies of most listed items. Prices are set by the LDC from time to time
,
and normally include a permanent ``research-only'' license (i.e., no
commercial use).
Payment may be made by check drawn from a bank with branches in the United State
s or
payment may be wiredto: Mellon Bank East, ABA NO. 03100003, Philadelphia, PA, fo
r
credit to The Trustees of the University of Pennsylvania, Account No 2945020,
Attn: Sarah Parnum 215-898-0464.

Prices are subject to change; the prices above are effective until 1 June 1995.
None members must purchase a minimum of $200 in databases, and add a shipping ch
arge
for each order: $30 US and Canada $50 overseas.

FOR MORE INFORMATION, including membership forms and catalogs:

LDC is at ftp.cis.upenn.edu under /pub/ldc. When accessing by ftp, use "anonymo
us"
as your userid, and your email address for password.

The LDC's World Wide Web Home Page holds the LDC catalog and the "README" files
from
most of the databases. It can be accessed at URL:

ftp://ftp.cis.upenn.edu/pub/ldc_www/hpage.html