Tokenizing Hyphenated Words in English

Ray Liere (lierer@mail.CS.ORST.EDU)
Wed, 3 Apr 1996 18:18:17 -0800

I am working on a project that initially converts free form English
text into "tokens", where each token is supposed to correspond to a
word.

To do this, I go through and convert all punctuation symbols into a
space, and then simply break the lines into tokens at each space
(or at end the of a line).

This works fine, at least for an initial try, except for the hyphen
in a hyphenated word. It seems that there are times when the "best"
token would be simply to leave the hyphen in (or, equivalently, to
completely remove it. For example, multi-valued --> multivalued.

There are other times when it seems that the best action would be
to replace the hyphen by a space, thus resulting in two tokens.
For example, Arab-Israeli --> Arab Israeli

I guess if I had to treat all hyphens the same, I would use it to
break a word into 2 tokens.

My current plan is to use a lookup table for, say, the situations where
a hyphen is to be removed (multi-valued --> multi-valued), and then
treat any remaining hyphens as spaces.

I am sure part of the difficulty is that many sources of text are
a bit "creative" in the use of hyphens, in that many times one may
witness the creation of new hyphenated words (not in any dictionary).

I am wondering if there are any general rules that I am missing? Or if
you can suggest places to look for further information?

If you wish to email to me, I will post a summary.

Thanks.

Ray Liere
Department of Computer Science
Oregon State University
lierer@mail.cs.orst.edu