Corpora: Frequency list for Russian

From: Serge Sharoff (
Date: Thu Apr 18 2002 - 09:49:36 MET DST

  • Next message: Eric Atwell: "Re: Corpora: CL course"

    The list of most frequent Russian words is available at:

    Currently Chastotnyj slovarj russkogo jazyka (Zasorina, 1977)
    provides the most widely used frequency list for Russian.
    However, the corpus used in Zasorina is relatively small
    according to modern standards (about 1 million words). It is
    outdated: mostly it covers uses from 1920s to 1960s and includes
    a high proportion of ideological sources, like texts by Lenin and
    Khrushchev and Soviet newspapers, thus, word frequencies in it
    are severely biased. Finally, the list of (Zasorina, 1977) is
    not available electronically.

    The announced list is compiled on the basis of a corpus of modern
    Russian fiction and political texts (more than 35 million words).
    The list includes about 33000 words which frequency is greater
    than 1 ipm (instances per million words). A shorter selection of
    5000 most frequent words is also available.

    The structure of the lists follows the template of the lemmatised
    BNC lists produced by Adam Kilgariff
    word rank, frequency (in ipm), word, part of speech.

    In addition, some analytical information about the lexical stock
    is provided, such as coverage of the total language use by word
    bands, e.g. first 3000 lemmas cover 76.6824% of the total number
    of word forms.

    The corpus, tools for working with it, as well as an aligned
    parallel English-Russian corpus are discussed in the forthcoming
    Sharoff, Serge, (2002). Meaning as use: exploitation of aligned
    corpora for the contrastive study of lexical semantics. Proc. of
    Language Resources and Evaluation Conference (LREC02). May, 2002,
    Las Palmas, Spain.

    This archive was generated by hypermail 2b29 : Thu Apr 18 2002 - 09:57:40 MET DST