[Corpora-List] Re: The size of Internet in words

From: Serge Sharoff (s.sharoff@leeds.ac.uk)
Date: Tue Jan 20 2004 - 19:40:44 MET

  • Next message: Mark Davies: "RE: [Corpora-List] The size of Internet in words"

    It is unusual to have a summary of responses for a query sent few hours ago.
    Thanks to Thierry Fontenelle. The answer is provided in the special issue of
    Computational Linguistics, Vol 29, No 3. The introduction written by Adam
    Kilgarriff and Greg Grefenstette lists data 30 different Latin-script
    languages (obtained through AltaVista in March 2001). The answer for ENglish
    is 76,598,718,000 words, German comes the second with 7,035,850,000 words,
    French the third (3,836,874,000).

    The issue is not yet available via Ingenta, but the introduction is freely
    downlodeable from the MIT Press website:
    http://www-mitpress.mit.edu/journals/pdf/coli_29_3_333_0.pdf

    I was interested in Russian data, but they are available from another
    source: Yandex (the major Russian search engine, http://www.yandex.ru )
    indexed 1,5 TB of unique texts (in Russian only), giving in total about 250
    billion words (more than in English by Kilgarriff and Greffenstette, but
    these are data from Feb 2004). If more recent data are available for
    English and other languages, please let me know.

    Best wishes,
    Serge

    --
    Dr. Serge Sharoff
    Centre for Translation Studies
    School of Modern Languages and Cultures
    University of Leeds
    Leeds, LS2 9JT
    

    tel: +44(0)113 343 7287 fax: +44(0)113 343 3287 WWW: http://www.comp.leeds.ac.uk/ssharoff/



    This archive was generated by hypermail 2b29 : Tue Jan 20 2004 - 19:44:48 MET