RE: Corpora: Relatve text length

From: Tadeusz Piotrowski (tadpiotr@plusnet.pl)
Date: Thu Apr 25 2002 - 20:23:17 MET DST

  • Next message: Tolkin, Steve: "RE: Corpora: Relatve text length"

    Common sense would say that what Prof. Wilks says is right, and I do
    believe he's right. This belief seems supported by the average word
    length in English and Polish:

    Average word length in characters:
    5.92 Polish (corpus for frequency dictionary from the 60's) 4.26 English
    (LOB)

    Thus, it seems straightforward that a text in English should be shorter
    than in Polish. Actually, it is very difficult to show this is really
    so. One might want to use translations. Here are the results:

    One utility text:

    Original English
    characters 95715
    words 14573

    Translated Polish
    characters 100756
    words 15243

    So far so good.

    But translators have their own individual style. To level that out, I
    checked one English text with three Polish translations.

    B. Singer On the wagon
    Words 4329
    Characters 24028

    translation1
    Words 3396
    Characters 22237

    translation2
    Words 3636
    Characters 23866

    translation3
    Words 3380
    Characters 22119

    And that is surprising: it is the English text that is longer.

    Tadeusz Piotrowski

    > -----Original Message-----
    > From: owner-corpora@lists.uib.no
    > [mailto:owner-corpora@lists.uib.no] On Behalf Of Yorick Wilks
    > Sent: Thursday, April 25, 2002 5:56 PM
    > To: James L. Fidelholtz
    > Cc: spela vintar; Andrew Bredenkamp; CORPORA@HD.UIB.NO
    > Subject: Re: Corpora: Relatve text length
    >
    >
    >
    > Isnt there some (minor) confusion here? If the question
    > really is relative TEXT length, then nothing to do with word
    > counts will settle it--what matters is character counts,
    > since word length varies considerably between languages. The
    > table showed 1984 in Estonian as having far fewer word tokens
    > in it than the English original, but I'd bet theyre much
    > longer ones--how about the texts then?? I have no parallel
    > texts with English and E. European languages but I do with
    > the four major W. European ones and the English pages are
    > shorter in every case. Yorick Wilks
    >
    >
    >
    >
    >
    >
    > James L. Fidelholtz" wrote:
    >
    > > Andrew and Spela:
    > > Just a word of caution: studies like Spela's provide
    > > interesting and suggestive data, but figures will surely vary,
    > > depending on the translator, topic, etc. [all the usual
    > > sociolinguistic caveats apply here] (and note Jean's contribution,
    > > with varying rates). I was coauthor of a study comparing
    > English and
    > > Spanish, which basically tried to get Spanish to fit into
    > the standard
    > > readability curves in a fairly simple way. We were only partially
    > > successful (the counts were hand-done by yours truly, featuring a
    > > variety of types of text, pseudo-randomly sampled, and especially
    > > translations from one language to the other, as well as
    > translations
    > > from 3rd languages [French & German] into each). To the best of my
    > > recollection (I could look up the exact figures if anyone
    > is hot for
    > > them), our results for Spanish-English were rather close to
    > Jean's for
    > > French (I assume his were on large amounts of text done by
    > > computer--if this holds up [not surprising, given the close
    > > relationship of French and Spanish], it may indicate that, for this
    > > kind of data, not such a huge amount of text is really necessary).
    > >
    > > On Wed, 24 Apr 2002, spela vintar wrote:
    > >
    > > >
    > > >Hi Andrew,
    > > >
    > > >for Eastern-European languages you can compare the lengths of
    > > >Orwell's 1984 and its translations that were collected within the
    > > >Multext-East project. The original Multext project
    > > >(http://www.lpl.univ-aix.fr/projects/multext/)
    > > >should provide the same for English, German, French,
    > Spanish etc., however I
    > > >wasn't able to find it on their homepage at first glance...
    > > >
    > > >Best,
    > > >Spela
    > > >
    > > >http://nl.ijs.si/ME/CD/docs/mte-d21f/node8.html
    > > >//////////////
    > > >...
    > > >Below we give an estimate for the number of words, by
    > language. The
    > > >wordcounts were produced by removing the SGML tags from
    > the texts and
    > > >then using a 'wc'-like procedure.
    > > >
    > > > English
    > > > 104.302
    > > > Romanian
    > > > 101.460
    > > > Slovene
    > > > 91.619
    > > > Bulgarian
    > > > 87.235
    > > > Czech
    > > > 80.366
    > > > Hungarian
    > > > 81.147
    > > > Estonian
    > > > 79.334
    > > >
    > > >
    > > >Andrew Bredenkamp wrote:
    > > >
    > > >> Hello everyone,
    > > >>
    > > >> Does anyone know where I can find a list of relative text length?
    > > >>
    > > >> Taking one language as an index (100), I would like a
    > list of the
    > > >> (other) main European languages - e.g. (made up):
    > > >>
    > > >> Spanish: 100
    > > >> English: 105
    > > >> French: 110
    > > >> German: 85
    > > >>
    > > >> ... etc.
    > > >>
    > > >> Thanks a lot in advance for any help you can give me.
    > > >>
    > > >> Cheers,
    > > >> Andrew
    > > >> =========================================
    > > >> Andrew Bredenkamp
    > > >> acrolinx GmbH
    > > >> URL: www.acrolinx.com
    > > >>
    > > >> =========================================
    > > >
    > > >
    > > >
    > >
    > > --
    > > James L. Fidelholtz e-mail: jfidel@siu.buap.mx
    > > Posgrado en Ciencias del Lenguaje tel.: +(52-2)229-5500 x5705
    > > Instituto de Ciencias Sociales y Humanidades fax:
    > +(01-2) 229-5681
    > > Benemérita Universidad Autónoma de Puebla, MÉXICO
    >
    >



    This archive was generated by hypermail 2b29 : Thu Apr 25 2002 - 20:23:22 MET DST