Re: [Corpora-List] Date: Wed, 11 Sep 2002 15:16:20 +0200

From: Jörg Tiedemann (joerg@stp.ling.uu.se)
Date: Wed Sep 11 2002 - 18:15:13 MET DST

  • Next message: James Magnuson: "Re: [Corpora-List] pronunciation lexica (SUMMARY)"

    I don't know of any single article which summarises the terminology with
    regards to parallel corpora but from my experience some of the
    differences are the following:

    * bilingual corpora are strictly two languages
    * a translation corpus should contain the original version and at least
      one translation (but not necessarily only one)
    * a parallel corpus contains translations of a common source but they do
      not need to include the original version (even if this sounds strange -
      I know of parallel corpora e.g. from the EU which do not indicate the
      original version and I used to work with some of them without
      knowing/using the original or intermediate documents)
    * parallel corpora should be aligned to some extent to make them
      searchable within linked segments, alignment can be done e.g. on
      paragraphs or sentences (translation corpora do not have to be aligned I
      would say)
    * comparable corpora are two or more corpora with similar size and from
      similar domains. usually people assume similar distribution of
      words/phrases in comparable corpora in order to compare them. They do
      not have to be parallel (or translations of each other)
    * comparable and parallel corpora do not have to include multiple
      languages whereas translation corpora should
    * sometimes I use another term for bilingual parallel corpora: bitexts -
      just to make it shorter. in this case, aligned segments within such
      corpora will be bitext segments

    I hope this helped a bit and did not create even more confusion,

    best regards,

    Jörg

    ***********/\/\/\/\/\/\/\/\/\/\/\************************************
    ** Joerg Tiedemann joerg@stp.ling.uu.se **
    ** Department of Linguistics http://stp.ling.uu.se/~joerg/ **
    ** Uppsala University tel: (018) 471 7007 **
    ** S-751 20 Uppsala/SWEDEN fax: (018) 471 1416 **
    *************************************/\/\/\/\/\/\/\/\/\/\/\**********

    On Wed, 11 Sep 2002 maria_rzewuska@mail.ukie.gov.pl wrote:

    > Hi, I have been reading the list for a while and lately I took a closer
    > look at some bilingual corpus projects and I noticed a relatively flexible
    > use of terms: translation corpus, parallel corpus, comaparable corpus, but
    > mainly between the two first. Maybe someone could tell me is there any
    > difference or is it simply mixed up. In the composition of the corpora I
    > did not find any difference which could explain the terminological
    > difference. Any book or clever article that I should read?
    > thanks
    >
    > Maria Rzewuska
    > Adam Mickiewicz University
    > Poznan
    > PL
    >
    >



    This archive was generated by hypermail 2b29 : Wed Sep 11 2002 - 18:24:19 MET DST