Re: [Corpora-List] Identifying words in Japanese

From: Timothy Baldwin (tbaldwin@csli.stanford.edu)
Date: Wed Jun 18 2003 - 03:24:39 MET DST

  • Next message: Christoph Neumann: "[Corpora-List] identifying words in Japanese"

    Hi,

    > In Japanese, often words are written in a mixture of two scripts: kanji
    > (logographs) and hiragana (syllabary). For example, where upper-case
    > letters indicate kanji, lower-case represent hiragana, and a space
    > indicates character boundaries, you might find the following word:
    >
    > HI k KO shi
    >
    > Unfortunately, anything that's written in kanji can alternatively be
    > written using hiragana.
    >
    > hi k ko shi
    >
    > Further complicating the problem, sometimes hiragana occurring after a
    > kanji (okurigana) are omitted or assumed.
    >
    > HIK KOSHI
    > HI k KOSHI
    > HIK KO shi
    >
    > Thus, a word like this can be written five different ways. Given all
    > this, how would one go about doing a word-frequency count in Japanese?
    > One option is to standardize everything to hiragana (doable). The
    > problem with this is that you then end up with a high percentage of
    > homographic heteronyms (they would be heterographic, were they written
    > in kanji).

    In fact, the problem is even worse than you describe it. It is possible to
    generate 4 more variants by using only the first or second kanji, and having
    the remainder of the string in hiragana:

    HI k ko shi
    HIK ko shi
    hi k KO shi
    hi k KOSHI

    All of these are attested, if at low frequency counts, through Google. You
    also get a lot of variability in transliterated words, in the length of vowels
    (e.g. koNpyuuta vs. koNpyuutaa for "computer"), orthographic-
    vs. phonemic-style transliteration (e.g. meetoru vs. meetaa for "metre"),
    dialect dependence (e.g. bodii vs. badii for "body") and consonant
    forms (e.g. naiibu vs. naiivu for "naive").

    The problem is not unlike spelling inconsistency in English, just much more
    rampant and less prescriptive. It's surprising how much spelling variation you
    get in English corpora, even post-edited corpora such as the infamous WSJ, but
    it has been ignored as a problem in English for the most part.

    That aside, there are a couple of possible workarounds. First, you could use a
    morphological analyser which canonicalises wordforms, and solves the problem
    for you. ChaSen and Juman canonicalise only conjugating words (i.e. verbs and
    adjectives) to their base forms in much the same way that English lemmatisers
    canonicalise wordforms. This is only a very partial solution to the problem
    you mention, as the base form is conditioned on the original wordform such
    that HI k KO si ta and HIK KO si ta (both meaning "moved") would be
    canonicalised to HI k KO su and HIK KO su, respectively. Also, non-conjugating
    words such as your hikkoshi example would not be canonicalised. The only
    analyser I know of which performs the extra step of base form canonicalisation
    of all words (including attempting to convert transliterated words into the
    wordform in the source language) is ALTJAWS, which is unfortunately
    proprietary.

    A more realistic, if primitive and noisy, way of canonicalising words would be
    to strip them of hiragana (in the case that they contain kanji or katakana
    characters). I.e. HI k KO shi would have hiragana k and shi removed to leave
    HI KO (which is then equivalent to HIK KOSHI, HI KOSHI and HIK KO, based on
    your original description above). This would lead to a proliferation of
    homographic heteronyms, perhaps comparable to converting everything into
    hiragana. Also, HIK KOSHI and HIK ko shi would end up as different lemmata.

    A third alternative would perhaps be to canonicalise words according to their
    non-hiragana content (as above) *and also* their hiragana-based reading
    (i.e. as a bigram such as <hikkoshi,HIK KOSHI>). This would reduce the
    proliferation of homographic heteronyms, but still not be able to deal with
    the HIK KOSHI/HIK ko shi pair above. It would, however, be able to distinguish
    non-homophonic homographs such as KATA vs. HOU.

    I guess one question you need to ask yourself in all this is what you consider
    as a "word", and to what degree you consider canonicalisation
    necessary. Certainly, I don't know of any "word counts" which attempt to do
    much about homography and spelling variation in the manner you suggest.

    > And a related question: does anyone have an extensive list of Japanese
    > transitive / intransitive verb pairs?

    Take a look at:

    http://www.csse.monash.edu.au/~jwb/afaq/jitadoushi.html

    Tim



    This archive was generated by hypermail 2b29 : Wed Jun 18 2003 - 03:29:10 MET DST