Re: [Corpora-List] summary: free sentencizers ; test differentsentencizers with cgi script

From: Shlomo Yona (shlomo@cs.haifa.ac.il)
Date: Mon Mar 10 2003 - 10:27:13 MET

Next message: Rayson, Paul: "RE: [Corpora-List] Lancaster Anaphoric Treebank"

Previous message: Joerg Schuster: "Re: [Corpora-List] summary: free sentencizers ; test differentsentencizers with cgi script"
In reply to: Joerg Schuster: "Re: [Corpora-List] summary: free sentencizers ; test differentsentencizers with cgi script"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 10 Mar 2003, Joerg Schuster wrote:

> I think one of the disandvantages of your program is that it stores
> all data in main memory. You have to say something like
>
> my $sentences=get_sentences($in);
>
> Though this is very comfortable when dealing with small files, I would
> like to rather say something like
>
> while(<>) {
> print_sentences;
> }
>
> Then huge files could easily be sentencized, too.

The thing is that some of the decisions are made globally.
Of course the program does not need more than a reasonable
window of text to make good decisions, but the size of that
windos is something the user should worry about (according
to the data available).

Given a huge file, you can first chop it into smaller chunks
(and you have the freedom to decide how to do that) and then
feed to the Lingua::EN::Sentence module each chunk at a time.

Taking input one line at a time will in most cases fail the
effort of determining the proper locations of sentence boundaries.

-- 
Shlomo Yona
shlomo@cs.haifa.ac.il
http://cs.haifa.ac.il/~shlomo/

Next message: Rayson, Paul: "RE: [Corpora-List] Lancaster Anaphoric Treebank"
Previous message: Joerg Schuster: "Re: [Corpora-List] summary: free sentencizers ; test differentsentencizers with cgi script"
In reply to: Joerg Schuster: "Re: [Corpora-List] summary: free sentencizers ; test differentsentencizers with cgi script"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Mar 10 2003 - 10:28:00 MET