Re: Corpora: Sentence splitting

Michael Barlow (barlow@ruf.rice.edu)
Fri, 16 Oct 1998 09:37:53 -0500 (CDT)

Tony Rose asks about sentence splitting.

I recently had to confront the problem of identifying sentence boundaries
because I wanted to allow a sentence display as well as a KWIC display in
developing a new version of a concordancer (MonoConc).

I adopted the usual conventions and exceptions, but worried about
abbreviations such as Dr. in languages other than English. I needed a
general default for those who did not want to list their own exceptional
cases and decided that since the abbreviations by their nature are short
then a general algorithm for exceptions would be any string of 3 letters
occurring before a full stop in which at least one letter is upper case.

If adopted for English (American). this will fail on sentences such as "He
lives in Tucson, AZ." However, it is better for my purposes to have two
sentences appear rather than a fragment. I have just implemented this
feature and so haven't yet tested it on a variety of different languages
and so I don't really know how successful it is.

Michael
----------------------------------------------------------------------
Michael Barlow, Department of Linguistics, Rice University
barlow@rice.edu www.ruf.rice.edu/~barlow
Athelstan barlow@athel.com www.athel.com (U.S.) www.athelstan.com (UK)