The LDC United Nations corpus

Mark Liberman (myl@sansom.ling.upenn.edu)
Fri, 11 Aug 1995 09:38:46 EDT

In LINGUIST 6.973 and a parallel query to the corpora list, Raphael
Salkie asked some questions about the UN Parallel Corpus from the
Linguistic Data Consortium (LDC). Before these queries appeared, a
response to an earlier query from Dr. Salkie had already gone out
privately from Rebecca Finch. I certainly urge anyone with
experience in using the UN corpus to respond to Dr. Salkie as well.

We have also placed a sample of 24 (out of 21,000) parallel documents,
each in English, French and Spanish, in:

ftp://ftp.cis.upenn.edu/pub/ldc/data_samples/UN_Par_sample.tar.Z

These samples should also be accessible, along with quite a bit of
other LDC information, from the WWW page at URL

http://www.cis.upenn.edu/~ldc

Let me add a few words about LDC prices and costs, since Dr. Salkie's
message expressed the normal human annoyance at being asked to part
with money, both in the case of the UN corpus and another (not yet
published) LDC parallel text corpus, the Canadian Hansard.

The LDC membership fee for a university is $2,000, and for this fee
everyone at that university can get an unlimited and perpetual
research license for everything that the LDC publishes during the year
of membership. Thus you can join the other ninety current members of
the corsortium and get not only the forthcoming Hansard corpus, but
also the other twenty or so databases published this year. For the
same amount, a university can get the UN corpus and the other 15
databases published in 1994.

Whether a particular database, or collection of databases, is worth
that amount of money is of course a matter of individual or
institutional judgement. We feel that $2,000, which is roughly
the cost of a moderately configured PC or an international conference
trip, is not out of line even for university researchers.

Speaking for myself, I have a great deal of sympathy for the effort to
provide research resources free or at minimal cost, and I have been
involved in several successful efforts to bring out such databases
over the years, including the ACL/DCI CD-ROM, the ECI disk, the CELEX
disk, and others offered in the range of $25-$200. These efforts rely
heavily on volunteer labor and other donated resources; in several
cases they have also relied on cash donations from the LDC.

However, volunteer labor is rarely available in the needed quantities;
and of course LDC-supplied cash, as well as the existence of the LDC
as an organization, depends on income from somewhere. The money that
we get from memberships and database sales is a crucial part of the
picture---without it the LDC would not exist, and neither would either
of the databases under discussion.

To highlight the point, the history of the U.N. publication is worth
reviewing briefly.

We decided to try to publish the U.N. archives because translation
researchers wanted parallel texts. After concluding several months of
negotiations with UN representatives and lawyers for both sides, we
paid for a NJ-based computer consultant to go into the UN offices at
night so as to make backups of the archives from dismountable disk
packs for a long-obsolete Wang word processor onto cartridge
tapes. This required several months and cost a considerable sum; we
had to use this particular person because he was an authorized service
rep for the UN facility. Then came six person-months of work at the
LDC. We had to decode the proprietary and undocumented Wang BACKUP
format, and the equally proprietary and undocumented Wang character
set, typographical codes and file structures. We re-organized the
entire archive and translated it into WordPerfect format, and
published a certain number of CD-ROMs in this form for the purposes of
the UN---this was part of the agreement that we made with them for
access to the data. Then we translated the documents into ISO-8859-1
with SGML markup (including a working DTD, for those how care), and
worked out the correspondences among documents. This was far from
trivial, since each UN language had been entered separately, with no
coordination of file names, file dates, or even division of documents
into files, and there were tens of thousands of documents per
language. This work was mainly done by Dave Graff, whose salary the
LDC pays.

We are not likely to recover the costs of acquisition and production
of this database through sales and memberships bought for its
sake. We subsidize our members by cost-sharing with government grants,
or by using income from more popular or less expensive databases to
cover unrecovered costs of less popular or more expensive ones.

In the case of the forthcoming Hansard corpus, which Dr. Salkie also
mentioned, the cost of acquisition and publication has been similar to
that of the U.N. material, and the same remarks about subsidies
apply. Whether a particular database is worth a certain price is a
matter of individual taste, but as a matter of simple arithmetic, the
fees charged in these two cases are unlikely ever to cover the costs
incurred.

For those who have read this far, I would like to repeat a standing
offer that has been in existence since the beginning of the LDC. If
you are interested in CD-ROM publication of a language-related
database that is plausibly of interest to our membership, and this
database is reasonably close to being in shape to publish, we will pay
the costs of production, using your label design or one worked out
with you; we will give you up to a hundred copies to do with as you
see fit; we will put the item in our catalogue at whatever price you
choose; and we will remit to you any resulting income in excess of our
production costs. The copyright (if any) will remain with you, and we
will handle any user license arrangements that may be necessary,
sending the signed licenses to you. We have published several
databases on this basis, and are planning to publish several others,
although (from past experience) the chances of making back our
production costs are no better than even.

Best wishes,

Mark Liberman myl@unagi.cis.upenn.edu

619 Williams Hall
University of Pennsylvania Phone: 215-898-0141
Philadelphia, PA 19104-6305 Fax: 215-573-2175