Re: [Corpora-List] summary: free sentencizers ; test differentsentencizers with cgi script

From: Shlomo Yona (shlomo@cs.haifa.ac.il)
Date: Mon Mar 10 2003 - 10:27:13 MET

  • Next message: Rayson, Paul: "RE: [Corpora-List] Lancaster Anaphoric Treebank"

    On Mon, 10 Mar 2003, Joerg Schuster wrote:

    > I think one of the disandvantages of your program is that it stores
    > all data in main memory. You have to say something like
    >
    > my $sentences=get_sentences($in);
    >
    > Though this is very comfortable when dealing with small files, I would
    > like to rather say something like
    >
    > while(<>) {
    > print_sentences;
    > }
    >
    > Then huge files could easily be sentencized, too.

    The thing is that some of the decisions are made globally.
    Of course the program does not need more than a reasonable
    window of text to make good decisions, but the size of that
    windos is something the user should worry about (according
    to the data available).

    Given a huge file, you can first chop it into smaller chunks
    (and you have the freedom to decide how to do that) and then
    feed to the Lingua::EN::Sentence module each chunk at a time.

    Taking input one line at a time will in most cases fail the
    effort of determining the proper locations of sentence boundaries.

    -- 
    Shlomo Yona
    shlomo@cs.haifa.ac.il
    http://cs.haifa.ac.il/~shlomo/
    



    This archive was generated by hypermail 2b29 : Mon Mar 10 2003 - 10:28:00 MET