RE : Stop words list

Llu s Padr (padro@lsi.upc.es)
Fri, 13 Sep 1996 11:06:59 UTC+0100

>> So, do you know where i could find these kind of list in French, German,
>> Italian, Spanish, Danish...but not in English.
>
>you effectively already have them.
>
>if you have enough text in these languages to make indexing
>interesting, then it is literally just two hours work to work through
>the thousand most common words to select however many stop words you
>want to use.
>
>since it is still pretty much early days for seriously multilingual
>IR, if you then distribute these lists, you will have effectively set
>the standard.

At least for Spanish, you can save the work if you anonymous ftp to

ftp-lsi.upc.es

and look in directory

pub/lluisp

I put there two files:

empty.spanish
empty.english

which contain stop words in those languages.
(the spanish file also contains POS tags for each word and lexical
probabilities for each tag, but you can just ignore them if you are
not interested)

to recover the files you have to uudecode and uncompress them.

good luck

Lluis

--------------------------------------------------------------------------------
Lluís Padró i Cirera
Departament de Llenguatges i Sistemes Informàtics
Universitat Politècnica de Catalunya
padro@lsi.upc.es
http://www-lsi.upc.es/~lluisp
--------------------------------------------------------------------------------
Edifici U | EUPVG
c/ Pau Gargallo 5 34-3-4017988 | 34-3-8967751 Av. Victor Balaguer s/n
08028 Barcelona Fax 4017014 | Fax 8967700 08800 Vilanova i la Geltrú
Catalonia | Catalonia
--------------------------------------------------------------------------------