Re: Corpora: Sentence splitter

Keith J. Miller (keith@mitre.org)
Tue, 05 Oct 1999 12:42:37 -0400

Also available is Reynar and Ratnaparkhi's MXTERMINATOR. Their one-line
description is:

MXTERMINATOR is a JAVA (JDK 1.1) implementation of the sentence boundary
detector described in:
Jeffrey C. Reynar and Adwait Ratnaparkhi. A Maximum Entropy Approach to
Identifying Sentence Boundaries. In Proceedings of the Fifth Conference on
Applied Natural Language Processing, March 31-April 3, 1997. Washington,
D.C.

Also included in the distribution is the ability to train new models on
your own data.

More information can be found at
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/MXTERMINATOR.html

The referenced paper can be found at:
http://xxx.lanl.gov/ps/cmp-lg/9704002

----- Keith J. Miller
millerk@gusun.georgetown.edu
keith@mitre.org

Martin Wynne wrote:

> Dear Corporans,
>
> I have plain text parallel corpora in French, German, Spanish and
> English which I would like to align automatically. However all of the
> alignment programs that I have access to require sentence tags in the
> texts. Can anyone recommend a good sentence splitter either for plain
> running text files or for files with minimal SGML markup (we've got to
> do this to them too), which would preferably be free, easy to install
> and run under Unix, (although DOS/Windows programs could be used) and
> will work for these languages.
>
> Many thanks for any suggestions,
>
> Martin
>
> **********************************************************************
> Martin Wynne Multilinguale Forschung
> Visiting Research Fellow Abteilung LEXIK
> wynne@ids-mannheim.de Institut fuer deutsche Sprache
> Tel: +49 621 1581 427 R5, 6-13
> Fax: +49 621 1581 415 D-68161 Mannheim
> +49 621 1581 200
> **********************************************************************