Re: Corpora: Relatve text length

From: Martin Wynne (
Date: Fri Apr 26 2002 - 13:51:41 MET DST

  • Next message: Mike Maxwell: "Corpora: Morphology and Word Length (was: Relatve text length)"

    The MULTEXT-EAST corpora are available from the TRACTOR archive
    For Orwell's 1984 in original and translations, I looked at the values for
    the 'extent' element in the headers and got the following information:

    English 104302 words 928986 bytes
    Bulgarian 87235 words 2733655 bytes
    Czech 80366 words 1230804 bytes
    Estonian 79334 words 1066273 bytes
    Hungarian 81167 words 1270210 bytes
    Romanian 118093 words 1272607 bytes
    Slovene 91619 words 945857 bytes
    Latvian 81956 words 1051 kb
    Lithuanian 71252 words 904 kb
    Serbo-Croatian 89749 words 863 kb
    Russian 76469 words 2.2 mb

    Please note that the headers also include caveats and explanations regarding
    how the counts were done. Basically, the wordcounts appear to be a count of
    the number of tokens in the text, while the byte counts generally include
    the header and tags too. Please refer to the actual headers for further
    information and acknowledgements of the researchers involved.

    Martin Wynne
    Linguistics Officer
    Oxford Text Archive

    Oxford University Computing Services
    13 Banbury Road
    UK - OX2 6NN
    Tel: +44 1865 283299
    Fax: +44 1865 273275

    This archive was generated by hypermail 2b29 : Fri Apr 26 2002 - 13:56:13 MET DST