Corpora: French corpora and software - Summary

From: NOELLE-VERONIQUE SERPOLLET (n.serpollet@lancaster.ac.uk)
Date: Thu Jun 15 2000 - 13:27:22 MET DST

  • Next message: Kraaij, Wessel: "Corpora: Abbreviation lists"

    Dear list members,

    After having thanked the people who helped me with my query regarding
    "Parallel corpora and French software", here is now a sunmmary of the
    results I obtained:

            * software that I could use to tag/analyse my French data

    Michael Barlow is currently developing ParaConc.
    <The new version will be based on
    <the code from MonoConc Pro and will be similar in functionality (but
    with
    <more functions) to the one that you are using, [ParaConc, 1995], but
    the <underlying code will be different.

    http://jupiter.inalf.cnrs.fr/WinBrill/
    (Maria José Ribeiro <mj.ribeiro@NETC.PT>)

            * tagger/concordancer which would enable me to retrieve
    occurrences
            of the French subjunctive

    Cordial 6 Universités a a tagger/lemmatizer for French which does it:
    1 Il il PPER3S
    2 faut falloir VINDP3S
    3 que que SUB
    4 je je PPER1S
    5 vienne venir VSUBP1S
    6 . . PCTFORTE
    (Jean Veronis, http://www.up.univ-mrs.fr/~veronis)
    For more information, contact SYNAPSE Développement
    www.synapse-fr.com

            * gather a French/English parallel corpus (with the texts being
    aligned if possible).

     <ARCADE corpus of ca. 1.5M words of Fr/En texts aligned at sentence
    level:
    <http://www.up.univ-mrs.fr/~veronis/arcade

    <The corpus is distributed by ELRA:
    <http://www.icp.grenet.fr/ELRA/home.html
    (Jean Veronis, veronis@up.univ-mrs.fr)

    Tim Johns' website: http://web.bham.ac.uk/johnstf/timconc.htm

    <He's been working on parallel concordancing within the Lingua
    <project on multilingual parallel concordancing. I'm not
    <quite sure whether you'll find actual corpora there, but
    <there may be something, plus probably useful links.
    (Antoine Consigny, anconsig@liverpool.ac.uk, anconsig@yahoo.fr)

    Two corpora, primarily political and legislative in their content.
    available from the LDC:

    <UN Parallel Text (English/Spanish/French)
    <http://morph.ldc.upenn.edu/Catalog/LDC94T4A.html

    <-- you can request just the English and French data, if you
    <prefer; the full corpus is a 3-cdrom set, with one language per
    <cdrom, one text document per data file, and alignment at the level
    <of document/file only.

    <Canadian Hansards (French/English)
    <http://morph.ldc.upenn.edu/Catalog/LDC95T20.html

    <-- a single cdrom containing
    <two distinct sets of parallel text; one set is aligned at the
    <sentence level, and the other (smaller) set is aligned at the
    <paragraph level (with additional alignment data for individual
    <word tokens within paragraphs).

    Please write to ldc@ldc.upenn.edu if you would like further
    information or are interested in purchasing either of these
    collections.
    (Shannon Sears, Linguistic Data Consortium, ssears@ldc.upenn.edu
    www: http://www.ldc.upenn.edu)

    I hope this will be of interest to a lot of members.
    Noelle
    ---------------------
    Noëlle SERPOLLET
    Department of Linguistics and MEL
    Lancaster University,
    LANCASTER, LA1 4YT, UK
    e-mail: n.serpollet@lancaster.ac.uk



    This archive was generated by hypermail 2b29 : Thu Jun 15 2000 - 13:26:02 MET DST