Re: Corpora: Sentence splitting

Ted E. Dunning (ted@aptex.com)
Fri, 16 Oct 1998 09:25:30 -0700

>> Tony Rose asks about sentence splitting.

> Michael Barlow answers with a heuristic.

The standard set of heuristics includes (but isn't limited to) the
following thoughts:

0) Sentence boundaries occur at one of "." (periods), "?" or "!".

a) In mixed case text, periods followed by whitespace followed
by a lower case letters are not sentence boundaries.

b) Periods followed by a digit with no intervening whitespace
are not sentence boundaries.

c) Periods followed by whitespace and then an upper case
letter, but preceded by any of a short list of titles are not sentence
boundaries. Sample titles include Mr., Mrs., Sr., HMS., Amb. and so
on.

d) Periods internal to a sequence of letters with no adjacent
whitespace are not sentence boundaries (for example, www.aptex.com, or
e.g).

e) Periods before double line-ends are probably sentence
boundaries.

f) Periods followed by certain kinds of punctuation (notably comma
and more periods) are probably not sentence boundaries.

These heuristics will get you pretty far. There are still some
serious problems such as the fact that Inc. can end a sentence, but it
often precedes a stock ticker symbol which is generally in upper
case. Also, some punctuation such as parentheses should be treated in
a few cases as whitespace.

It should be noted that sentence boundary finders can have radically
different performance on different kinds of text. I have seen a
boundary finder which had 99+% accuracy on a variety of texts degrade
to 80% or less on a new, idiosyncratic kind of text.

If you care about these issues, it is important that you actually test
your boundary finder. I recommend building a small GUI which shows
you the sentence boundaries that your system finds and allows you to
manipulate these boundaries easily. With a system such as this, you
can also add the feature that changing the boundary detection
algorithm will cause errors and changes in behavior to be highlighted.
With a good annotation tool, you can very quickly develop a system
which achieves better than 99% accuracy on *your* text. Better still,
the side-effect of having such a tool is that you can quickly collect
a regression suite so that the performance of your system is
monotonically increasing over time.