Summary: Tokenization of Hyphenated Words

Ray Liere (lierer@mail.CS.ORST.EDU)
Mon, 8 Apr 1996 07:00:48 -0700

On 3 April, I posted an inquiry about how to treat hyphenated words when
breaking English text into tokens.

My thanks to Gregory Grefenstette, Colin Matheson, and Pascale Fung
for the very helpful responses.

This posting provides a summary of their responses.

My original posting:
> I am working on a project that initially converts free form English
> text into "tokens", where each token is supposed to correspond to a
> word.
>
> To do this, I go through and convert all punctuation symbols into a
> space, and then simply break the lines into tokens at each space
> (or at end the of a line).
>
> This works fine, at least for an initial try, except for the hyphen
> in a hyphenated word. It seems that there are times when the "best"
> token would be simply to leave the hyphen in (or, equivalently, to
> completely remove it. For example, multi-valued --> multivalued.
>
> There are other times when it seems that the best action would be
> to replace the hyphen by a space, thus resulting in two tokens.
> For example, Arab-Israeli --> Arab Israeli
>
> I guess if I had to treat all hyphens the same, I would use it to
> break a word into 2 tokens.
>
> My current plan is to use a lookup table for, say, the situations where
> a hyphen is to be removed (multi-valued --> multi-valued), and then
> treat any remaining hyphens as spaces.
>
> I am sure part of the difficulty is that many sources of text are
> a bit "creative" in the use of hyphens, in that many times one may
> witness the creation of new hyphenated words (not in any dictionary).
>
> I am wondering if there are any general rules that I am missing? Or if
> you can suggest places to look for further information?
>
> If you wish to email to me, I will post a summary.
>
> Thanks.
>
> Ray Liere
> Department of Computer Science
> Oregon State University
> lierer@mail.cs.orst.edu

Responses:
==========
>From: Gregory Grefenstette <Gregory.Grefenstette@Grenoble.RXRC.Xerox.com>
> We describe a small experiment in dehyphenating lines
> in our paper:
>
> What is a Word, What is a Sentence? Problems of Tokenization. in the proceedings of
> on Computational Lexicography (COMPLEX'94). pages
> 79-87. ISBN 963 8461 78 0, Research Institute for Linguistics Hungarian Academy of Sciences,
> Budapest, 1994.
>
> a copy of which is available on our WWW server at:
>
> http://www.xerox.fr/grenoble/mltt/articles/home.html
>
>
>
> ____________________________________________________________________________
> |
> Gregory Grefenstette | E-Mail : grefen@xerox.fr OR
> Multilingual Theory | Gregory.Grefenstette@grenoble.rxrc.xerox.com
> and Technology | Phone : (33) 76 61 50 82
> Rank Xerox Research Centre | fax : (33) 76 61 50 99
> _____________________________|_____________________________________________

>From: Colin Matheson <colin@cogsci.ed.ac.uk>
> The BibEdit project that I worked on included a tokenisation stage.
> I've put a PostScript version of one of the relevant deliverables in
> our ftp directory. It's in old LaTeX, unfortunately, so I did a quick
> hack, but the means that a couple of graphics are missing. Anyway, to
> get it, the ftp address is <ftp.cogsci.ed.ac.uk>, and the file is in
> </pub/colin/citeread.ps>.
>
> Briefly, tokenisation is an extremely difficult operation to
> generalise - hyphens aren't the only things you get in "words" -
> particularly when you include proper names, dates, and so on.
> Similarly, many space-bounded "words" actually represent multiple bits
> of information - things like "pp23ff", "Colin.Matheson@ed", and
> zillions more.
>
> Colin

>From: Pascale Fung <pascale@cs.columbia.edu>
> I have an idea. How about:
>
> (1) Use dictionary look up, any word which is a single word in the
> dictionary is treated as a single token, i.e. Arab-Israeli ->
> "Arab Israeli". But "multi-valued" -> "multivalued" because
> "multi" is not a word.
>
> (2) Use statistics. This takes care of the cases where a hyphen is inserted
> "creatively". i.e. If you see a word with hyphen sometimes ,
> eg."multi-valued", and without hyphen sometimes in the same text,
> "multivalued", just default to the case of a single token without hyphen.
> If you see part of a word without hyphen, by itself, sometimes in the text,
> eg. "The *Arab* countries", treat that as a single token as in "*Arab*-Israeli".
>
> (3) Overall, I am not sure you need to take the hyphen out of any word in
> English though. "Arab-Israeli" can be simply treated as a single token for
> many NLP purposes.
>
> regards
>
> pascale fung
> CS department
> Columbia University
> NY NY 10027
- - - end - - -