Re: Summary #2: Tokenization of Hyphenated Words

Pascale Fung (pascale@cs.columbia.edu)
Tue, 16 Apr 1996 15:32:55 -0400 (EDT)

In Chinese and Japanese, where there is no space delimiter between words,
a tokenization step is necessary as preprocessing for many NLP
applications. We have found that this tokenization step is very dependent
on the actual application. For machine translation of lexical items, it
seems that a "normal form" of tokens of the source and target languages is
most desirable. Things like "O'Grady" , "B.B.C" or even "New York-New Haven
Railway" are almost certainly tokenized as one word in their translations
in Chinese/Japanese. Many technical terms are considered as one word in these
languages as well whereas they are considered as compound words in European
languages. To account for such a difference, we perform 2 preprocessing
steps before trying to translate/align lexical items in the source and
target languages:

(1) Tokenize the texts independently, this would result in "New York-New
Haven Railway" being tokenized as one word in Chinese/Japanese but as
multiple words in English. Depending on the method, "York-New" might
become one word. (We use tokenizers based on statistics,
morphological information and dictionary entries in Chinese/Japanese.)

(2) Group compound terms together into single tokens by using technical
term extractors. Terms incorrectly "broken up" in step (1) would be regropued
together. e.g. "New York-New Haven Railway" would then become one
"term" in both languages. (Our term extractors are based on morphological,
syntactic and statistical information in Japanese,Chinese and English).

We have not found it necessary to fully parse the sentences. Of course,
both tokenizors and the technical term extrators are not perfect. And we do
not deal with compound words other than technical terms (because common
collocations in English other than technical terms can sometimes correspond
to unconnected Chinese/Japanese words, also because common collocations are
less difficult to translate than technical terms). But this seems to be
working quite well as preprocessing for our bilingual technical term
lexicon compilation tasks.

I am interested to see if others are doing the same thing while dealing
with Asian/Indo-European bilingual processing, or if there is any suggestion
for better methods.

pascale