My thanks to Gregory Grefenstette, Colin Matheson, and Pascale Fung
for the very helpful responses.
This posting provides a summary of their responses.
My original posting:
> I am working on a project that initially converts free form English
> text into "tokens", where each token is supposed to correspond to a
> word.
>
> To do this, I go through and convert all punctuation symbols into a
> space, and then simply break the lines into tokens at each space
> (or at end the of a line).
>
> This works fine, at least for an initial try, except for the hyphen
> in a hyphenated word. It seems that there are times when the "best"
> token would be simply to leave the hyphen in (or, equivalently, to
> completely remove it. For example, multi-valued --> multivalued.
>
> There are other times when it seems that the best action would be
> to replace the hyphen by a space, thus resulting in two tokens.
> For example, Arab-Israeli --> Arab Israeli
>
> I guess if I had to treat all hyphens the same, I would use it to
> break a word into 2 tokens.
>
> My current plan is to use a lookup table for, say, the situations where
> a hyphen is to be removed (multi-valued --> multi-valued), and then
> treat any remaining hyphens as spaces.
>
> I am sure part of the difficulty is that many sources of text are
> a bit "creative" in the use of hyphens, in that many times one may
> witness the creation of new hyphenated words (not in any dictionary).
>
> I am wondering if there are any general rules that I am missing? Or if
> you can suggest places to look for further information?
>
> If you wish to email to me, I will post a summary.
>
> Thanks.
>
> Ray Liere
> Department of Computer Science
> Oregon State University
> lierer@mail.cs.orst.edu
Responses:
==========
>From: Gregory Grefenstette <Gregory.Grefenstette@Grenoble.RXRC.Xerox.com>
> We describe a small experiment in dehyphenating lines
> in our paper:
>
> What is a Word, What is a Sentence? Problems of Tokenization. in the proceedings of
> on Computational Lexicography (COMPLEX'94). pages
> 79-87. ISBN 963 8461 78 0, Research Institute for Linguistics Hungarian Academy of Sciences,
> Budapest, 1994.
>
> a copy of which is available on our WWW server at:
>
> http://www.xerox.fr/grenoble/mltt/articles/home.html
>
>
>
> ____________________________________________________________________________
> |
> Gregory Grefenstette | E-Mail : grefen@xerox.fr OR
> Multilingual Theory | Gregory.Grefenstette@grenoble.rxrc.xerox.com
> and Technology | Phone : (33) 76 61 50 82
> Rank Xerox Research Centre | fax : (33) 76 61 50 99
> _____________________________|_____________________________________________
>From: Colin Matheson <colin@cogsci.ed.ac.uk>
> The BibEdit project that I worked on included a tokenisation stage.
> I've put a PostScript version of one of the relevant deliverables in
> our ftp directory. It's in old LaTeX, unfortunately, so I did a quick
> hack, but the means that a couple of graphics are missing. Anyway, to
> get it, the ftp address is <ftp.cogsci.ed.ac.uk>, and the file is in
> </pub/colin/citeread.ps>.
>
> Briefly, tokenisation is an extremely difficult operation to
> generalise - hyphens aren't the only things you get in "words" -
> particularly when you include proper names, dates, and so on.
> Similarly, many space-bounded "words" actually represent multiple bits
> of information - things like "pp23ff", "Colin.Matheson@ed", and
> zillions more.
>
> Colin
>From: Pascale Fung <pascale@cs.columbia.edu>
> I have an idea. How about:
>
> (1) Use dictionary look up, any word which is a single word in the
> dictionary is treated as a single token, i.e. Arab-Israeli ->
> "Arab Israeli". But "multi-valued" -> "multivalued" because
> "multi" is not a word.
>
> (2) Use statistics. This takes care of the cases where a hyphen is inserted
> "creatively". i.e. If you see a word with hyphen sometimes ,
> eg."multi-valued", and without hyphen sometimes in the same text,
> "multivalued", just default to the case of a single token without hyphen.
> If you see part of a word without hyphen, by itself, sometimes in the text,
> eg. "The *Arab* countries", treat that as a single token as in "*Arab*-Israeli".
>
> (3) Overall, I am not sure you need to take the hyphen out of any word in
> English though. "Arab-Israeli" can be simply treated as a single token for
> many NLP purposes.
>
> regards
>
> pascale fung
> CS department
> Columbia University
> NY NY 10027
- - - end - - -