[Corpora-List] Parallel texts for machine translation evaluation

From: D Elliott (debe@comp.leeds.ac.uk)
Date: Wed May 21 2003 - 12:04:27 MET DST

  • Next message: Sandra Wendland: "[Corpora-List] Product Announcement: Canoo's Morphology Software now available for Italian"

    Dear all

    I am collecting parallel texts for a corpus designed specifically for MT
    evaluation (to be made available online for research) and would appreciate
    any advice on where to find parallel texts of a particular kind.....

    Source texts/extracts of approx. 400 words each in:
    French, Italian, German, Spanish, Chinese (Simplified and/or Traditional),
    Japanese, Russian and Portuguese.

    The challenge is that these must have very good quality human English
    translations which can be used as a 'gold standard' against which we
    can compare MT output. (NB British English if possible) I am just
    beginning to realise how difficult a task I have set myself! (Another
    problem is that some multilingual sites are localised to such an extent
    that parts have been rewritten rather than translated - doh!)

    The kinds of texts in the corpus will represent current MT use. The
    following (provisional) categories have been selected, following a
    worldwide survey of MT users:

    Technical documents (eg. software user manuals, online help, telecoms,
    automotive, aerospace)
    Correspondence (letter/emails)
    Academic papers
    Tourist/travel information
    Newspaper articles
    Medical documents
    Scientific documents
    Financial documents (stock exchange reports, banking, insurance)
    Legal documents (including patents)
    Calls for tender
    Internal company documents (eg. minutes, training material, company
    reports)

    Any URLs or other sources (even on paper!) would be gratefully received.
    Sources which do not require copyright permission would also be a big
    time-saver. All sources will obviously be acknowledged in the corpus.

    I will post a summary of feedback as soon as the deluge stops (wishful
    thinking!)

    Debbie Elliott

    For more information on the project so far, see:
    Elliott, Debbie; Hartley, Anthony; Atwell, Eric. Rationale for a
    multilingual corpus for machine translation evaluation in: Archer,
    D, Rayson, P, Wilson, A & McEnery, T (editors) Proceedings of CL2003:
    International Conference on Corpus Linguistics, pp. 191-200 Lancaster
    University. 2003.

    ***************************************************
    Debbie Elliott
    Computer Vision and Language Research Group,
    School of Computing,
    University of Leeds,
    Leeds LS2 9JT
    United Kingdom.
    Tel: 0113 3436818
    Email: debe@comp.leeds.ac.uk
    ***************************************************



    This archive was generated by hypermail 2b29 : Wed May 21 2003 - 12:16:05 MET DST