Re: [Corpora-List] Corpus-building for minority languages

From: Kevin Patrick Scannell (scannell@slu.edu)
Date: Fri Mar 19 2004 - 20:20:20 MET

Next message: bond@cslab.kecl.ntt.co.jp: "[Corpora-List] CFP: ACL-2004 Workshop on Multiword Expressions: Integrating Processing"

Previous message: Kevin Patrick Scannell: "[Corpora-List] Corpus-building for minority languages"
Next in thread: P bI K O B___ B.B. (MOCKBA): "[Corpora-List] Russian Corpora at Russian Congress"
Reply: P bI K O B___ B.B. (MOCKBA): "[Corpora-List] Russian Corpora at Russian Congress"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> Can you give a rough comparison (on the mailing list) of how this compares
> with CorpusBuilder, from Carnegie-Mellon University?

It is very similar. I didn't discover the CorpusBuilder until after I
had my crawler almost completed, otherwise I probably would have
just used theirs! (NLP is not my main research area).

Most of the differences an end user wouldn't see or care about --
for instance, my method of "query generation" is quite different;
my goal being the broadest possible coverage at the risk
of having to throw away a lot of documents not in the desired languages.

I also do some real "crawling": that is, following internal links in the
documents recursively (and giving up on branches when they stop
yielding new documents in the target language).

Another difference seems to be that my software tries to
build the language filter on the fly. It seems, though,
that they've tried this too (here's the relevant chunk from their paper):

"Our approach performs well at collecting documents in a minority
language starting from a few words or documents but it does require
a language filter for that minority language. There are filters
available for quite a few language[s] but this is potentially
a limitation of our approach. In earlier work, we experimented
with constructing a filter on-the-fly... with... encouraging results."

-Kevin

>
>
> http://www-2.cs.cmu.edu/afs/cs/project/theo-4/text-learning/www/corpusbuild
>er/
>
> Mike Maxwell
> Linguistic Data Consortium
> maxwell@ldc.upenn.edu

Next message: bond@cslab.kecl.ntt.co.jp: "[Corpora-List] CFP: ACL-2004 Workshop on Multiword Expressions: Integrating Processing"
Previous message: Kevin Patrick Scannell: "[Corpora-List] Corpus-building for minority languages"
Next in thread: P bI K O B___ B.B. (MOCKBA): "[Corpora-List] Russian Corpora at Russian Congress"
Reply: P bI K O B___ B.B. (MOCKBA): "[Corpora-List] Russian Corpora at Russian Congress"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Mar 19 2004 - 20:25:58 MET