Re: Corpora: Neologisms in Japanese

From: Masaaki NAGATA (nagata@nttnly.isl.ntt.co.jp)
Date: Wed Apr 18 2001 - 11:36:19 MET DST

  • Next message: Dale Gerdemann: "Corpora: Postdoc position in Tuebingen"

    > I have been trying out in vain to find statistical data or literature (in
    > Japanese or other languages) on the following topic:
    >
    > What is the percentage of words written in katakana or kanji respectively
    > among Japanese neologisms since 1945 and especially during the last decade
    > (1991-2000)?

    I once computed the relative frequencies of each character type in EDR
    corpus classified by the various sources. The result is attached at
    the end of this mail.

    EDR corpus is one of the largest annotated Japanese corpus. Its
    English description is available at http://www.iijnet.or.jp/edr/.

    My gut feeling is that the proportion of kanji and katakana in
    Japanese greatly depends on the topic of the text. If the topic is
    relatively new or Western-origin things, such as computer science,
    there are a lot of katakana words.

    The proportion of kanji/katakana and hiragana is also related to the
    difficulty of the text. The more hiragana words are used, the more
    plain the text is. So the text books for children has more hiragana
    words than newspapers.

    > What is the largest corpus of Japanese words and proper names that can be
    > accessed online?

    If you are looking for a free corpus and a free dictionary, I think
    IPAL dictionary and IPAL corpus are the largest ones. There is a
    Japanese description in the following URL.

    http://www.ipa.go.jp/STC/NIHONGO/IPAL/ipal.html

    If you can read Japanese, there is a comprehensive list of language
    resources at Prof. Matsumoto's labs at Nara Advanced Institute of
    Science and Technology, which is written in Japanese, unfortunately.

    http://cactus.aist-nara.ac.jp/lab/resource/resource.html

    Web translation services such as the following might help you a little
    bit.

    http://sangenjaya.arc.net.my/index-e.html

    -Masaaki

    -----
    Masaaki NAGATA, NTT Cyber Space Laboratories
    1-1 Hikarinooka Yokosuka-Shi Kanagawa 239-0847 Japan
    Email: nagata@nttnly.isl.ntt.co.jp Tel: +81-468-59-2796 Fax: +81-468-59-4758

    ----------------------------------------------------------------------
                             alpha hira kan kata num sym
    Aera (magazine) 0.003 0.463 0.354 0.080 0.025 0.076
    Iwanami Info. Sci. Dict. 0.020 0.401 0.372 0.130 0.005 0.072
    Magazines 0.042 0.387 0.255 0.196 0.030 0.090
    Asahi newspaper 0.002 0.456 0.391 0.059 0.022 0.069
    Nikkei newspaper 0.001 0.512 0.369 0.051 0.000 0.067
    Heibonsha encyclopedia 0.003 0.443 0.403 0.080 0.008 0.062
    Sentence Examples 0.000 0.603 0.293 0.027 0.000 0.077

    Aera: 49589 sentences
    Monthly magazine published by Asahi Shinbun (newspaper). Like News Week.

    Iwanami Information Science Dictionary: 13578
    Computer science dictionary published by Iwanami Shoten (publisher).

    Magazines: 21199 sentences
    Miscelaneous collection of magazines

    Asahi newspaper: 91400 sentences
    One of the most popular national newspaper in Japan

    Nikkei newspaper: 5018 sentences
    One of the most popular economic newspaper. Like Wall Street Journal.

    Heibonsha encyclopedia: 10072 sentences
    One of the largest Japanese encycolopedia. Heibonsha is the name of
    the publisher.

    Sentence Examples: 16946 sentences
    It seems there sentences are taken from dictionaries. But I don't know
    where they come.

    ----------------------------------------------------------------------



    This archive was generated by hypermail 2b29 : Wed Apr 18 2001 - 11:38:35 MET DST