Re: stop words list

asmeaton@CompApp.DCU.IE
Fri, 13 Sep 1996 09:21:24 +0100

Regarding stop word lists for languages other than English, Ted Dunning
says:

>
> you effectively already have them.
>
> if you have enough text in these languages to make indexing
> interesting, then it is literally just two hours work to work through
> the thousand most common words to select however many stop words you
> want to use.
>
True. As part of the TREC benchmarking exercise for information retrieval
systems there is a track which looks at evaluating IR for Spanish texts
which continues this year (and also looks at evaluating IR on Chinese texts)
and most groups who did this either hand-translated stop word lists from
another language (usually English) into Spanish, and then checked the top
few hundred most frequently occurring words from the corpus to see if they
missed any, or just scrolled through the top few hundred most frequently
occurring words, selecting stop words.

Ted also says:

> since it is still pretty much early days for seriously multilingual
> IR, if you then distribute these lists, you will have effectively set
> the standard.
>
True again in that there are no standard stop word lists for any language
and it is early days for multilingual IR, but that is not to say that there is
no progress ... there was a workshop on MLIR after the SIGIR conference in
August last month and some of the papers presented showed approaches to
multilingual and cross-lingual IR which were not just novel and interesting,
but working, albeit not as effective as monolingual. MLIR is a "hot" topic,
... watch this space.

There is an attempt being made to publish the proceedings of this workshop, if
it happens I'll keep this list posted.

- Alan Smeaton