[Corpora-List] Identifying words in Japanese

From: Brett Reynolds (brett@forsyths.ca)
Date: Tue Jun 17 2003 - 17:18:01 MET DST

  • Next message: Adam Kilgarriff: "Re: [Corpora-List] Legal aspects of compiling corpora"

    In Japanese, often words are written in a mixture of two scripts: kanji
    (logographs) and hiragana (syllabary). For example, where upper-case
    letters indicate kanji, lower-case represent hiragana, and a space
    indicates character boundaries, you might find the following word:

    HI k KO shi

    Unfortunately, anything that's written in kanji can alternatively be
    written using hiragana.

    hi k ko shi

    Further complicating the problem, sometimes hiragana occurring after a
    kanji (okurigana) are omitted or assumed.

    HIK KOSHI
    HI k KOSHI
    HIK KO shi

    Thus, a word like this can be written five different ways. Given all
    this, how would one go about doing a word-frequency count in Japanese?
    One option is to standardize everything to hiragana (doable). The
    problem with this is that you then end up with a high percentage of
    homographic heteronyms (they would be heterographic, were they written
    in kanji).

    Any other ideas?

    And a related question: does anyone have an extensive list of Japanese
    transitive / intransitive verb pairs?

    -----------------------
    Brett Reynolds
    Ontario, Canada
    brett@forsyths.ca



    This archive was generated by hypermail 2b29 : Tue Jun 17 2003 - 17:16:02 MET DST