Re: Tokenizing Hyphenated Words in English

Colin Matheson (colin@cogsci.ed.ac.uk)
Thu, 04 Apr 96 14:10:18 +0100

> I am working on a project that initially converts free form English
> text into "tokens", where each token is supposed to correspond to a
> word.

> I am wondering if there are any general rules that I am missing? Or if
> you can suggest places to look for further information?

The BibEdit project that I worked on included a tokenisation stage.
I've put a PostScript version of one of the relevant deliverables in
our ftp directory. It's in old LaTeX, unfortunately, so I did a quick
hack, but the means that a couple of graphics are missing. Anyway, to
get it, the ftp address is <ftp.cogsci.ed.ac.uk>, and the file is in
</pub/colin/citeread.ps>.

Briefly, tokenisation is an extremely difficult operation to
generalise - hyphens aren't the only things you get in "words" -
particularly when you include proper names, dates, and so on.
Similarly, many space-bounded "words" actually represent multiple bits
of information - things like "pp23ff", "Colin.Matheson@ed", and
zillions more.

Colin