UN Parallel corpus - info please!

Raphael Salkie (R.M.Salkie@bton.ac.uk)
Sat, 15 Jul 1995 01:34:43 GMT

I have some questions for anyone out there who has bought the UN parallel
corpus from the LDC:

1. What does the corpus contain? Is it just formal documents, or are there
transcriptions of general assembly sessions, etc? Are all the texts of the
same general type, or is there a lot of variety?

2. What condition are the texts in? Are they plain texts, or is there a lot
of garbage in them? Do they include SGML tags?

3. How many of the texts are genuine translations of each other? (I am
particularly interested in English - French, but info about Spanish is also
useful).

I have been unable to get this information from the LDC. I have a grant to
buy the corpus - I want to know if it is worth the money.

If you've read this far you are probably interested in parallel corpora, so
here's another question. I would also like to obtain (at least some of) the
Canadian Hansard. A well-known corpus linguist (I won't reveal his name, but
it begins with B- and ends with -urnard) recently said to me that he thought
the Canadian Hansard was available for 50 dollars. On the LDC price list, the
cost is 5000 dollars. Can anyone explain why this usually well-informed
source had this idea. Is the corpus, or part of it, genuinely "available" for
a small cost? Please let me know.

Many thanks.

Raphael Salkie
University of Brighton, UK
<rms3@bton.ac.uk>