It is unusual to have a summary of responses for a query sent few hours ago.
Thanks to Thierry Fontenelle. The answer is provided in the special issue of
Computational Linguistics, Vol 29, No 3. The introduction written by Adam
Kilgarriff and Greg Grefenstette lists data 30 different Latin-script
languages (obtained through AltaVista in March 2001). The answer for ENglish
is 76,598,718,000 words, German comes the second with 7,035,850,000 words,
French the third (3,836,874,000).
The issue is not yet available via Ingenta, but the introduction is freely
downlodeable from the MIT Press website:
http://www-mitpress.mit.edu/journals/pdf/coli_29_3_333_0.pdf
I was interested in Russian data, but they are available from another
source: Yandex (the major Russian search engine, http://www.yandex.ru )
indexed 1,5 TB of unique texts (in Russian only), giving in total about 250
billion words (more than in English by Kilgarriff and Greffenstette, but
these are data from Feb 2004). If more recent data are available for
English and other languages, please let me know.
Best wishes,
Serge
-- Dr. Serge Sharoff Centre for Translation Studies School of Modern Languages and Cultures University of Leeds Leeds, LS2 9JTtel: +44(0)113 343 7287 fax: +44(0)113 343 3287 WWW: http://www.comp.leeds.ac.uk/ssharoff/
This archive was generated by hypermail 2b29 : Tue Jan 20 2004 - 19:44:48 MET