Re: Corpora: Multi-lingual Copora?

Bill Fisher (william.fisher@nist.gov)
Tue, 27 Apr 1999 09:16:59 -0400

Charles -

You may find some gems in the "Call Home" corpus (or
collection) that's available from the LDC. These are
recorded telephone calls from one person here in the
states to a friend or relative back home in another
country. It's categorized by the main language
of the call -- English, Mandarin, etc. In each of
these, a secondary language occurs fairly often.
Because we've been interested in just mono-lingual speech
recognition, we haven't used utterances in the
secondary language, but they're there. And their
presence is marked in the transcriptions, although
individual words are not generally transcribed.

What I've said is also true of the "Call Friend" corpus,
also available from the LDC, which records calls between
friends in the States, although I believe the incidence
of English is higher in both English and non-English
categories of these calls.

- Bill F. /NIST