Re: Summary #2: Tokenization of Hyphenated Words

Colin Matheson (colin@cogsci.ed.ac.uk)
Tue, 16 Apr 96 10:31:05 +0100

> >From: Mitch Marcus <mitch@linc.cis.upenn.edu>
> > Note the problem in treating hyphenated strings as single words in
> > "the New York-New Haven railroad".

Just let me mention once again the tokenisation study undertaken for
the BibEdit project, and for the Editor's Assistant project, in
Edinburgh. Almost all the problems mentioned were looked at, and we
ended up using a fairly full chart parser to do the processing. The
basic discussion is in the (PostScript) paper at <ftp.cogsci.ed.ac.uk>
in </pub/colin/citeread.ps>. There's some stuff in the final report
too, so if anyone's interested, please let me know and I'll dig it
out.

Just a couple of quick points - hypens are not the only word-internal
punctuation - ("O'Grady", "Mel'cuk", "5:5:52", "B.B.C." and suchlike)
- and there is a complication of Marcus' example above in such as "the
inter- and intra-mural activities". The latter are suprisingly
common, at least in the data we were dealing with. We also assumed
that things like "5th May 1952" are tokens in the same sense as
"5-5-52", but clearly what you call these objects is pretty arbitrary.

The last point leads to Monaghan's question about the need for
tokenisation, and it does seem to me just to depend on what you call
the stage at which you change streams of characters into chunks -
something clearly has to be done at the character level to handle
"23rd", "34.12mm", and so on, and whether you call the sub-parts "23"
"rd", or their combination, `lexical items' or `tokens' is a matter of
choice, as far as I can see. Note that the choice of numbers here is
deliberate - it effectively means that a word list would be infinitely
large. From a `linguistic' point of view, I don't know what the
question is.

We certainly found it useful to conceive of the problem of
characterising the objects which a `normal' sentence grammar would use
separately, although in terms of the processing techniques and
knowledge representation used, only the character chunking stage is
different in that you can easily use an FSM to do the work.

Colin