[Corpora-List] identifying words in Japanese

From: Christoph Neumann (neumann@nova.co.jp)
Date: Wed Jun 18 2003 - 04:40:36 MET DST

  • Next message: Tony Rose: "RE: [Corpora-List] text categorisation - newspaper"

    Good morning from Tokyo,.

    To my mind, the variation in the writing system is not such a big
    problem for word frequency count. One has to combine several approaches,
    though.
    Most Japanese words have a canonical way of the choice of whether they
    are written with hiragana or kanji, or, with "kun-yomi" verbs,
    adjectives and their derived nouns, a canonical combination of kanji and
    okurigana. This standard "orthograph" is automatically suggested by
    Japanese word processors, a good check way to check for it if you are
    not sure.
    Variation in the mix of kanji/hiragana seems to occur only in a
    restricted part of the vocabulary, namely with compound verbs and their
    derived nouns (like in the hikkoshi-example). Even there, there is
    normally one dominant preference (word processor!). As the general
    pattern is always Kanji-Hiragana-Kanji-Hiragana, one might think of a
    dynamic solution of identifying all variations.
    Only (and fortunately) very frequent words seem to have real
    (unpredictable) variation like "watashi" ("I") . As those words are
    limited in number, one can account for them beforehand by explicitly
    defining several variation sets, or simply add up their scores manually,
    having a look at the top 100 or so ranking words.

    Christoph Neumann
    Brett Reynolds wrote:

    > In Japanese, often words are written in a mixture of two scripts:
    > kanji (logographs) and hiragana (syllabary). For example, where
    > upper-case letters indicate kanji, lower-case represent hiragana, and
    > a space indicates character boundaries, you might find the following
    > word:

    >
    > HI k KO shi
    >
    > Unfortunately, anything that's written in kanji can alternatively be
    > written using hiragana.
    >
    > hi k ko shi
    >
    > Further complicating the problem, sometimes hiragana occurring after a
    > kanji (okurigana) are omitted or assumed.
    >
    > HIK KOSHI
    > HI k KOSHI
    > HIK KO shi
    >
    > Thus, a word like this can be written five different ways. Given all
    > this, how would one go about doing a word-frequency count in Japanese?
    > One option is to standardize everything to hiragana (doable). The
    > problem with this is that you then end up with a high percentage of
    > homographic heteronyms (they would be heterographic, were they written
    > in kanji).
    >

    -- 
    Dr. Christoph Neumann 		neumann@crosslanguage.co.jp
    R&D MT, CrossLanguage KK
    Tokyo, Japan
    http://www.crosslanguage.co.jp/english/index.html
    



    This archive was generated by hypermail 2b29 : Wed Jun 18 2003 - 04:44:04 MET DST