Re: [Corpora-List] Identifying words in Japanese

From: Timothy Baldwin (tbaldwin@csli.stanford.edu)
Date: Wed Jun 18 2003 - 03:24:39 MET DST

Next message: Christoph Neumann: "[Corpora-List] identifying words in Japanese"

Previous message: cyrille: "Re: [Corpora-List] list of stopwords for french"
In reply to: Brett Reynolds: "[Corpora-List] Identifying words in Japanese"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

> In Japanese, often words are written in a mixture of two scripts: kanji
> (logographs) and hiragana (syllabary). For example, where upper-case
> letters indicate kanji, lower-case represent hiragana, and a space
> indicates character boundaries, you might find the following word:
>
> HI k KO shi
>
> Unfortunately, anything that's written in kanji can alternatively be
> written using hiragana.
>
> hi k ko shi
>
> Further complicating the problem, sometimes hiragana occurring after a
> kanji (okurigana) are omitted or assumed.
>
> HIK KOSHI
> HI k KOSHI
> HIK KO shi
>
> Thus, a word like this can be written five different ways. Given all
> this, how would one go about doing a word-frequency count in Japanese?
> One option is to standardize everything to hiragana (doable). The
> problem with this is that you then end up with a high percentage of
> homographic heteronyms (they would be heterographic, were they written
> in kanji).

In fact, the problem is even worse than you describe it. It is possible to
generate 4 more variants by using only the first or second kanji, and having
the remainder of the string in hiragana:

HI k ko shi
HIK ko shi
hi k KO shi
hi k KOSHI

All of these are attested, if at low frequency counts, through Google. You
also get a lot of variability in transliterated words, in the length of vowels
(e.g. koNpyuuta vs. koNpyuutaa for "computer"), orthographic-
vs. phonemic-style transliteration (e.g. meetoru vs. meetaa for "metre"),
dialect dependence (e.g. bodii vs. badii for "body") and consonant
forms (e.g. naiibu vs. naiivu for "naive").

The problem is not unlike spelling inconsistency in English, just much more
rampant and less prescriptive. It's surprising how much spelling variation you
get in English corpora, even post-edited corpora such as the infamous WSJ, but
it has been ignored as a problem in English for the most part.

That aside, there are a couple of possible workarounds. First, you could use a
morphological analyser which canonicalises wordforms, and solves the problem
for you. ChaSen and Juman canonicalise only conjugating words (i.e. verbs and
adjectives) to their base forms in much the same way that English lemmatisers
canonicalise wordforms. This is only a very partial solution to the problem
you mention, as the base form is conditioned on the original wordform such
that HI k KO si ta and HIK KO si ta (both meaning "moved") would be
canonicalised to HI k KO su and HIK KO su, respectively. Also, non-conjugating
words such as your hikkoshi example would not be canonicalised. The only
analyser I know of which performs the extra step of base form canonicalisation
of all words (including attempting to convert transliterated words into the
wordform in the source language) is ALTJAWS, which is unfortunately
proprietary.

A more realistic, if primitive and noisy, way of canonicalising words would be
to strip them of hiragana (in the case that they contain kanji or katakana
characters). I.e. HI k KO shi would have hiragana k and shi removed to leave
HI KO (which is then equivalent to HIK KOSHI, HI KOSHI and HIK KO, based on
your original description above). This would lead to a proliferation of
homographic heteronyms, perhaps comparable to converting everything into
hiragana. Also, HIK KOSHI and HIK ko shi would end up as different lemmata.

A third alternative would perhaps be to canonicalise words according to their
non-hiragana content (as above) *and also* their hiragana-based reading
(i.e. as a bigram such as <hikkoshi,HIK KOSHI>). This would reduce the
proliferation of homographic heteronyms, but still not be able to deal with
the HIK KOSHI/HIK ko shi pair above. It would, however, be able to distinguish
non-homophonic homographs such as KATA vs. HOU.

I guess one question you need to ask yourself in all this is what you consider
as a "word", and to what degree you consider canonicalisation
necessary. Certainly, I don't know of any "word counts" which attempt to do
much about homography and spelling variation in the manner you suggest.

> And a related question: does anyone have an extensive list of Japanese
> transitive / intransitive verb pairs?

Take a look at:

http://www.csse.monash.edu.au/~jwb/afaq/jitadoushi.html

Tim

Next message: Christoph Neumann: "[Corpora-List] identifying words in Japanese"
Previous message: cyrille: "Re: [Corpora-List] list of stopwords for french"
In reply to: Brett Reynolds: "[Corpora-List] Identifying words in Japanese"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Jun 18 2003 - 03:29:10 MET DST