Corpora: Sentence boundary code.

Chris Brew (Chris.Brew@edinburgh.ac.uk)
Fri, 16 Oct 1998 15:55:16 +0100 (BST)

Tony Rose wrote:
_______________________________________________________________________

Does anyone have any experience of developing simple algorithms or
regular expressions for detecting sentence boundaries in English text?
The naive solution would be simply to look for full stops (periods)
followed by whitespace, but this fails on strings such as "Dr. Smith".

Indeed, the problem is common to so many NLP applications that it may
be reasonable to suggest that someone out there must have worked on
this and packaged up the result as a code 'module', to save others
the trouble. Yet if you examine the code to a great many NLP
applications, you find that typically people will develop their
own solution each time.

So, to start the ball rolling, here's a Perl regular expression
for detecting sentences, suggested by one of my colleagues:

/
(
.+? # match (non-greedy) anything ...
[.!?] # ... followed by any one of !?.
[")]? # ... and optionally " or )
)
(?= # with lookahead that it is followed by ...
(?: # either ...
\s+ # some whitespace ...
["(]? # maybe a " or ( ...
[A-Z] # and capital letter
| # or ...
\s*$ # optional whitespace, followed by end of string
)
)
/gx
;

Can anyone suggest a better algorithm/solution? It doesn't have to be
in Perl or any other particular language: pseudocode will do fine.
Also, does anyone know of any established test sets for evaluating
such algorithms? If people want to reply directly to me then I'll
summarise to the list.

(NB - I plan also to submit this question to a Perl mailing list, but
right now the experiences of the corpora community are of greater
interest to me.)

_______________________________________________________________________

This is a substantial task if taken seriously. My colleague Andrei
Mikheev has a solution which he used as an example in

A. Mikheev "Feature Lattices for Maximum Entropy Modelling" COLING-ACL '98
pp 848-854

he uses a set of 27294 test sentences which he obtained from David Palmer
(this work is in Computational Linguistics 23 (2) pp 241-169), and
which was also used by Reynar and Ratnaparkhi (ANLP 97). Andrei gets
99.2477% correct.

Of course, for your applications, you may be happy with a simpler solution
such as the one you provided in the original message.

Chris

Email: Chris.Brew@edinburgh.ac.uk
Address: Language Technology Group, HCRC,
2 Buccleuch Place, Edinburgh EH8 9LW,Scotland
Telephone: +44 131 650 4632 Fax: +44 131 650 4587