[Corpora-List] Corpus size & frequency counts

From: Adam Kilgarriff (adam.kilgarriff@itri.brighton.ac.uk)
Date: Wed Oct 09 2002 - 17:02:30 MET DST

  • Next message: Stefan Wermter: "[Corpora-List] New MSc Intelligent Systems"

    Brett@staff.sakuragaoka.ac.jp writes:
    > I'm doing a frequency count of Japanese vocabulary in post-war Japanese
    > novels. Is there any rough guide to how many times a given word should
    > appear before you can be reasonably confident of its rank? Or
    > alternatively, at a given frequency, any way to calculate the likely range
    > of ranks?
    >

    it's always a good idea to look at distribution as well as
    frequency. Where a word has its frequency spread across a large
    number of documents (say, 50 or more) - and the documents cover the
    genre you want to talk about (so, eg, they do not all come from the
    same author), then you can talk with some confidence about frequency
    in the text type.

    Where the occurrences mostly come from a small number of documents,
    the issue is more complex. A word like goalkeeper probably feels
    pretty common to any English speaker who is interested in soccer, much
    less common to anyone who is not interested. Correspondingly, most
    novels won't mention goalkeepers but those that do may well mention
    them losts of times. This implies, minimally, frequency should be
    seen as having two dimensions, one which is simply the count, the
    other whcih is the spread.

    Issues include what counts as "the same document" (two articles form
    the same magazine?? two chapters from the same book??), and what to do
    about specialist subcorpora within the text type you arte interested in
    (eg multiple articles from the same journal/by the same author - see
    my mailing re words like 'colitis' in the BNC from a few weeks ago).

    See also corpora mailing by Ken Church a couple of months back - there
    is his paper on "two noriegas". Another very good paper is by Slava
    Katz

    Article{katz:96,
      author = "Slava Katz",
      title = "Distribution of content words and phrases in text
                      and language modelling",
      journal = "Natural Language Engineering",
      year = 1996,
      volume = 2,
      number = 1,
      pages = "15--60"
    }

    Regards,

            Adam Kilgarriff

    -- 
    NEW!! MSc and Short Courses in Lexical Computing and Lexicography
    Info at
    

    http://www.itri.brighton.ac.uk/lexicom

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Adam Kilgarriff Senior Research Fellow tel: (44) 1273 642919 Information Technology Research Institute (44) 1273 642900 University of Brighton fax: (44) 1273 642908 Lewes Road Brighton BN2 4GJ email: Adam.Kilgarriff@itri.bton.ac.uk UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



    This archive was generated by hypermail 2b29 : Wed Oct 09 2002 - 17:16:49 MET DST