Re: [Corpora-List] Corpus-building for minority languages

From: Kevin Patrick Scannell (scannell@slu.edu)
Date: Fri Mar 19 2004 - 20:20:20 MET

  • Next message: bond@cslab.kecl.ntt.co.jp: "[Corpora-List] CFP: ACL-2004 Workshop on Multiword Expressions: Integrating Processing"

    > Can you give a rough comparison (on the mailing list) of how this compares
    > with CorpusBuilder, from Carnegie-Mellon University?

    It is very similar. I didn't discover the CorpusBuilder until after I
    had my crawler almost completed, otherwise I probably would have
    just used theirs! (NLP is not my main research area).

    Most of the differences an end user wouldn't see or care about --
    for instance, my method of "query generation" is quite different;
    my goal being the broadest possible coverage at the risk
    of having to throw away a lot of documents not in the desired languages.

    I also do some real "crawling": that is, following internal links in the
    documents recursively (and giving up on branches when they stop
    yielding new documents in the target language).

    Another difference seems to be that my software tries to
    build the language filter on the fly. It seems, though,
    that they've tried this too (here's the relevant chunk from their paper):

    "Our approach performs well at collecting documents in a minority
    language starting from a few words or documents but it does require
    a language filter for that minority language. There are filters
    available for quite a few language[s] but this is potentially
    a limitation of our approach. In earlier work, we experimented
    with constructing a filter on-the-fly... with... encouraging results."

    -Kevin

    >
    >
    > http://www-2.cs.cmu.edu/afs/cs/project/theo-4/text-learning/www/corpusbuild
    >er/
    >
    > Mike Maxwell
    > Linguistic Data Consortium
    > maxwell@ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Fri Mar 19 2004 - 20:25:58 MET