Re: Corpora: Relatve text length

From: James L. Fidelholtz (
Date: Thu Apr 25 2002 - 17:27:25 MET DST

  • Next message: Yorick Wilks: "Re: Corpora: Relatve text length"

    Andrew and Spela:
            Just a word of caution: studies like Spela's provide interesting
    and suggestive data, but figures will surely vary, depending on the
    translator, topic, etc. [all the usual sociolinguistic caveats apply
    here] (and note Jean's contribution, with varying rates). I was
    coauthor of a study comparing English and Spanish, which basically tried
    to get Spanish to fit into the standard readability curves in a fairly
    simple way. We were only partially successful (the counts were
    hand-done by yours truly, featuring a variety of types of text,
    pseudo-randomly sampled, and especially translations from one
    language to the other, as well as translations from 3rd languages
    [French & German] into each). To the best of my recollection (I could
    look up the exact figures if anyone is hot for them), our results for
    Spanish-English were rather close to Jean's for French (I assume his
    were on large amounts of text done by computer--if this holds up [not
    surprising, given the close relationship of French and Spanish], it may
    indicate that, for this kind of data, not such a huge amount of text is
    really necessary).

    On Wed, 24 Apr 2002, spela vintar wrote:

    >Hi Andrew,
    >for Eastern-European languages you can compare the lengths of Orwell's 1984
    >and its translations that were collected within the Multext-East project.
    >The original Multext project (
    >should provide the same for English, German, French, Spanish etc., however I
    >wasn't able to find it on their homepage at first glance...
    >Below we give an estimate for the number of words, by language. The
    >wordcounts were produced by removing the SGML tags from the texts and then
    >using a 'wc'-like procedure.
    > English
    > 104.302
    > Romanian
    > 101.460
    > Slovene
    > 91.619
    > Bulgarian
    > 87.235
    > Czech
    > 80.366
    > Hungarian
    > 81.147
    > Estonian
    > 79.334
    >Andrew Bredenkamp wrote:
    >> Hello everyone,
    >> Does anyone know where I can find a list of relative text length?
    >> Taking one language as an index (100), I would like a list of the (other)
    >> main European languages - e.g. (made up):
    >> Spanish: 100
    >> English: 105
    >> French: 110
    >> German: 85
    >> ... etc.
    >> Thanks a lot in advance for any help you can give me.
    >> Cheers,
    >> Andrew
    >> =========================================
    >> Andrew Bredenkamp
    >> acrolinx GmbH
    >> URL:
    >> =========================================

    James L. Fidelholtz			e-mail:
    Posgrado en Ciencias del Lenguaje	tel.: +(52-2)229-5500 x5705
    Instituto de Ciencias Sociales y Humanidades	fax: +(01-2) 229-5681
    Benemérita Universidad Autónoma de Puebla, MÉXICO

    This archive was generated by hypermail 2b29 : Thu Apr 25 2002 - 17:30:26 MET DST