Re: Summary: Tokenization of Hyphenated Words

Christopher D Manning (manning+@andrew.cmu.edu)
Tue, 9 Apr 1996 10:00:42 -0400 (EDT)

Ray Liere's summary quotes Pascale Fung <pascale@cs.columbia.edu> as
saying:
> (3) Overall, I am not sure you need to take the hyphen out of any word in
> English though. "Arab-Israeli" can be simply treated as a single token for
> many NLP purposes.

On the contrary, for most corpora I think it is highly unsatisfactory to
not take out a large number of hyphens. In particular this is true for any
texts that follow the common English stylistic convention of hyphenating
multiword pre-head-noun modifiers. A few triple-hyphen examples from Dow Jones
newswire:

the 90-cent-an-hour rise
a yet-to-be-formed entity
the back-on-terra-firma toast
the five-by-eight-inch looseleaf
an aggressive three-to-five-year direct marketing plan
a final "take-it-or-leave-it" offer
a still-to-be-named model

They're really common (single and double hyphen examples being commoner
than triple hyphen examples, of course). A favorite example, that I use in
class every year, is the following quadruple-hyphen:

The idea of a child-as-required-yuppie-possession must be
motivating them

This actually illustrates a different rule where things can be hyphenated
to indicate that they are somehow perceived as a unitary concept, even when
it's not a pre-head modifier. That also happens occasionally.

Chris Manning.