Summary #2: Tokenization of Hyphenated Words

Ray Liere (lierer@mail.CS.ORST.EDU)
Mon, 15 Apr 1996 11:02:05 -0700

On 3 April, I posted an inquiry about how to treat hypehnated words
when breaking English into tokens. I posted a summary of responses
on 8 April ... which in turn generated some more responses, which I
have included below in this "summary #2" posting.

The treatment of hyphens, and in fact punctuation in general, is definitely
more difficult that I had thought!

The responses below brought up a couple of points which I should comment on.

Robert Amsler brings up the issue of a hyphen at the end of a line possibly
indicating continuation of the word on the next line. This is definitely
an issue in general -- I did not mention it in my original posting because
this does not happen to occur in the corpus that I am using. But certainly
the points he raises regarding the myriad uses of hyphens is of great interest
and relevance (and unfortunately does make the tokenizing task much more
difficult).

James Monaghan asked why I want to do the tokenization in the first place.
I am attempting to transform text into a form that will more easily allow
a computer to learn the topic/s covered by the text. In machine learning,
a common method of representing the text in a document is to represent
it as an attribute vector, where each attribute is a token and each
attribute's value is (typically) either
1) 0 or 1, depending on whether or not the token is present in that
document; this is a boolean representation
-or-
2) the number of occurrences of the token in that document; this is
a frequency representation

So, for example, if the entire corpus (all documents) contains the words:
able am computer hard i in of science the working zebra
Then the document
I am working hard in the hard science of computer science
would be represented as
1) 0 1 1 1 1 1 1 1 1 1 0
-or-
2) 0 1 1 2 1 1 1 2 1 1 0

This representation is admittedly very poor at representing the full
"meaning" of the document. I am using it as a baseline representation,
with the intention of comparing its results with richer representations.

This type of representation is, however, used in a great many information
retrieval and text categorization systems. Surprisingly (to me), it often
obtains quite good results. The development of better text representations
is an area of current research in information retrieval.

Below are summaries of the responses received from Christopher D Manning,
Robert Amsler, James Monaghan, and Mitch Marcus.

My thanks again to all who responded to my inquiry.

Ray Liere
lierer@mail.cs.orst.edu

Responses:
==========
>From: Christopher D Manning <manning+@andrew.cmu.edu>
> Ray Liere's summary quotes Pascale Fung <pascale@cs.columbia.edu> as
> saying:
> > (3) Overall, I am not sure you need to take the hyphen out of any word in
> > English though. "Arab-Israeli" can be simply treated as a single token for
> > many NLP purposes.
>
> On the contrary, for most corpora I think it is highly unsatisfactory to
> not take out a large number of hyphens. In particular this is true for any
> texts that follow the common English stylistic convention of hyphenating
> multiword pre-head-noun modifiers. A few triple-hyphen examples from Dow Jones
> newswire:
>
> the 90-cent-an-hour rise
> a yet-to-be-formed entity
> the back-on-terra-firma toast
> the five-by-eight-inch looseleaf
> an aggressive three-to-five-year direct marketing plan
> a final "take-it-or-leave-it" offer
> a still-to-be-named model
>
> They're really common (single and double hyphen examples being commoner
> than triple hyphen examples, of course). A favorite example, that I use in
> class every year, is the following quadruple-hyphen:
>
> The idea of a child-as-required-yuppie-possession must be
> motivating them
>
> This actually illustrates a different rule where things can be hyphenated
> to indicate that they are somehow perceived as a unitary concept, even when
> it's not a pre-head modifier. That also happens occasionally.
>
> Chris Manning.

>From: "Amsler, Robert" <amsler@dyncorp.com>
> Your postings seem to miss the general rules in English.
>
> First. There are three types of hyphens, those dictated by the typography,
> as in the splitting of whole words into parts to perform justification of
> text. These hyphens are very difficult to deal with because English folds
> true-hyphens into end-of-line hyphens and telling whether an end-of-line
> hyphen is to be preserved or removed when the lines are put back together is
> computationally much more sophisticated than single text processing can
> handle. There are then hyphens which are in the lexicon and finally those
> dictated by the sentence/grammar of text. Lexical hyphens are largely a
> form of transition in language between open and closed word forms.
> Occasionally, dictionaries will make a point about distinctions in meaning,
> as between "data base"
> and "data-base" and "database", but I tend to think since most of the
> population doesn't understand their distinctions that you'll encounter all
> forms being used in all situations and disagreements between different
> dictionaries, both by publishers and by one publisher over relatively short
> time intervals.
>
> Third, sententially determined hyphenation is a mechanism to prevent
> incorrect parsing of the phrase in which the words appear. It's a part of
> the what-modifies-what
> rules. The most important thing about this type of hyphenation is that it is
> NOT lexical in nature, i.e., it is determined by sentencial grammar and the
> hyphenation forms created have no longer term impact on the lexicon. They
> are not going to become entries in dictionaries, they are hapstance
> creations to meet immediate sentencial grammar requirements.
>
> There are several types of hyphenation in this class. One is created when
> sentencial fragments are used to modify nouns, as in the "x-ed" forms of
> "case-based", "computer-linked", "hand-delivered", etc. used to form
> adjectives which modify a head noun in a noun phrase.
>
> Another case involves a whole proverb, title or other expression used to
> modify a single noun, as in a "What-You-See-Is-What-You-Get computer
> interface", a "Bomb-them-back-to-the-stone-age military type", or a
> "Have-A-Nice-Day greeting".
> The general rule seems to be that you're trying to embed a multi-word open
> lexical item into a position in a sentence where only a single lexical word
> can reside; necessitating making the multi-word item into a single item.
>
> Then of course, there are the prefix and less commonly suffix hyphenation
> forms, as with the prefixes co-, -pre-, meta-, multi-, etc. There is also
> the very specialized form of a conjunction of prefixes, as in an expression
> such as "pre- and post-processing".
> I also believe there are some hyphens added to avoid awkward morphology, as
> particularly seems appropriate with the co- prefix on words which begin with
> letters than seemingly would complete the prefix to form an initial
> compound. I always see "co-workers" as being preferable to coworkers because
> of the likelihood of "cow" being seen in the preceding and find
> "co-occurrence" easier to read than 'cooccurrence". I suppose trying to
> describe something like "co-" applied to a word beginning with two o's would
> also dictate such formations, as in describing two
> people oogling someone as a "co-oogling" or a creature which pre-dates an
> "eel" as a "pre-eel" rather than a preeel.

>From: James Monaghan <J.Monaghan@herts.ac.uk>
> Dear Ray,
>
> First of all, as far as I know there are no universally accepted rules
> for hyphenisation (or even spelling hyphenizarion) in English.
>
> Secondly, and to me more important, there are even more problems with your
> underlying task, It seems to me that there are too many tokens in English
> orthography already. It seems to me that the following are already single
> lexical items and you muddy the waters if you divide them up.
>
> give up
> be ticked off
> rain cats and dogs
> monkey business
> dead cat bounce
> morning glory
> stars and stripes
>
> What is the point of tokenizing from a linguistic point of view?
>
> Just asking ;-)
>
> James
>
>
> ==============================================================================
> * Dr James Monaghan $
> * Speech and Language Technology $ Phone: (0)1707 285698
> * University of Hertfordshire $ Fax: (0)1707 285616
> * Watford Campus, Aldenham $ email: j.monaghan@herts.ac.uk
> * Watford, WD2 8AT, UK $
> ===============================================================================

>From: Mitch Marcus <mitch@linc.cis.upenn.edu>
> Note the problem in treating hyphenated strings as single words in
> "the New York-New Haven railroad".
>
> Mitch Marcus
- - - end - - -