Re: Corpora: Sentence splitting

Heui Seok Lim (limhs@nlp.korea.ac.kr)
Sat, 17 Oct 1998 10:12:09 +0900

Ted E. Dunning wrote:

> >> Tony Rose asks about sentence splitting.
>
> > Michael Barlow answers with a heuristic.
>
> The standard set of heuristics includes (but isn't limited to) the
> following thoughts:
>

I am surprised that Ted E. Dunning heuristics are very similar to mine,while
we don't know each other and have never discussed.
Maybe this is because people think alike in many things.
As I had experimented , all the heuristics are very effective and accurate.
But I'd like to add some more heuristics.

> 0) Sentence boundaries occur at one of "." (periods), "?" or "!".
>

0-1) Sentence boundaries also occur at """, "'", "-", "--" ex. He said
"where are you headed?"

> a) In mixed case text, periods followed by whitespace followed
> by a lower case letters are not sentence boundaries.
>
> b) Periods followed by a digit with no intervening whitespace
> are not sentence boundaries.
>
> c) Periods followed by whitespace and then an upper case
> letter, but preceded by any of a short list of titles are not sentence
> boundaries. Sample titles include Mr., Mrs., Sr., HMS., Amb. and so
> on.
>

c) To get sample titles(abbreviation), you'd better extract them from corpora.

As Ted E. Dunning said, heuristics may be not accurate according to your
texts.
In real text including illegal and cracy setences, you may modify those
heuristics a little.

--
=========================================================================
Name     : Heui Seok Lim (Ph.D)
E-Mail   : limhs@nlp.korea.ac.kr
HomePage : http://nlp.korea.ac.kr/~limhs
Address  : Human & Computer Interaction Lab.,
             Samsung Advanced Institute of Technology, P.O.
             Box  111, Suwon 400-600, Korea.
    Position: Member of Research Staff.
Phone    : (Lab) +82-331-280-8164 (FAX) +82-331-280-9208

``````````````` Love Everyone with my heart ````````````````````` =========================================================================