Re: Corpora: Studies about proportion of words in languages ?

From: Jean Veronis (Jean.Veronis@newsup.univ-mrs.fr)
Date: Tue Jun 06 2000 - 16:26:44 MET DST

  • Next message: Jean Veronis: "Re: Corpora: Parallel corpora and French software"

    At 15:42 06/06/2000 +0200, Marcelo Sztrum wrote:
    >Dear list members,
    >
    >Are there, do you know comparative and/or quantified studies about the
    >proportion/ratio of words (words in writing corpora) of one language(s) to
    >another(s) (i.e.: *** For every X (1000??) Spanish/English words, there is
    >Y (about 700????) German words, etc.***)?

    This was measured within the ARCADE project
    (http://www.up.univ-mrs.fr/~veronis/arcade) for French/English.

    The ratio of words between corresponding segments ranges from 1.08 to 1.16
    depending on the texts, French being the longest. This was measured on a
    corpus of ca. 1.5 M words manually aligned at the sentence level.

    To appear soon (this summer):

    Véronis, J. & Langlais, Ph. (2000). Evaluation of parallel text alignment
    systems: The ARCADE project. In J. Véronis (Ed.), Parallel Text Processing:
    Alignment and use of translation corpora (pp. 369-388). Dordrecht: Kluwer
    Academic Publishers.

    Jean Véronis
    http://www.up.univ-mrs.fr/~veronis

    PS: see also the bibliography on parallel texts at:
    http://www.up.univ-mrs.fr/~veronis/biblios/ptp.html



    This archive was generated by hypermail 2b29 : Tue Jun 06 2000 - 16:25:52 MET DST