[Corpora-List] identifying words in Japanese

From: Christoph Neumann (neumann@nova.co.jp)
Date: Wed Jun 18 2003 - 04:40:36 MET DST

Next message: Tony Rose: "RE: [Corpora-List] text categorisation - newspaper"

Previous message: Timothy Baldwin: "Re: [Corpora-List] Identifying words in Japanese"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Good morning from Tokyo,.

To my mind, the variation in the writing system is not such a big
problem for word frequency count. One has to combine several approaches,
though.
Most Japanese words have a canonical way of the choice of whether they
are written with hiragana or kanji, or, with "kun-yomi" verbs,
adjectives and their derived nouns, a canonical combination of kanji and
okurigana. This standard "orthograph" is automatically suggested by
Japanese word processors, a good check way to check for it if you are
not sure.
Variation in the mix of kanji/hiragana seems to occur only in a
restricted part of the vocabulary, namely with compound verbs and their
derived nouns (like in the hikkoshi-example). Even there, there is
normally one dominant preference (word processor!). As the general
pattern is always Kanji-Hiragana-Kanji-Hiragana, one might think of a
dynamic solution of identifying all variations.
Only (and fortunately) very frequent words seem to have real
(unpredictable) variation like "watashi" ("I") . As those words are
limited in number, one can account for them beforehand by explicitly
defining several variation sets, or simply add up their scores manually,
having a look at the top 100 or so ranking words.

Christoph Neumann
Brett Reynolds wrote:

> In Japanese, often words are written in a mixture of two scripts:
> kanji (logographs) and hiragana (syllabary). For example, where
> upper-case letters indicate kanji, lower-case represent hiragana, and
> a space indicates character boundaries, you might find the following
> word:

>
> HI k KO shi
>
> Unfortunately, anything that's written in kanji can alternatively be
> written using hiragana.
>
> hi k ko shi
>
> Further complicating the problem, sometimes hiragana occurring after a
> kanji (okurigana) are omitted or assumed.
>
> HIK KOSHI
> HI k KOSHI
> HIK KO shi
>
> Thus, a word like this can be written five different ways. Given all
> this, how would one go about doing a word-frequency count in Japanese?
> One option is to standardize everything to hiragana (doable). The
> problem with this is that you then end up with a high percentage of
> homographic heteronyms (they would be heterographic, were they written
> in kanji).
>

-- 
Dr. Christoph Neumann 		neumann@crosslanguage.co.jp
R&D MT, CrossLanguage KK
Tokyo, Japan
http://www.crosslanguage.co.jp/english/index.html

Next message: Tony Rose: "RE: [Corpora-List] text categorisation - newspaper"
Previous message: Timothy Baldwin: "Re: [Corpora-List] Identifying words in Japanese"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Jun 18 2003 - 04:44:04 MET DST