Corpora: stop-word lists

Bob Krovetz (krovetz@research.nj.nec.com)
Mon, 20 Oct 1997 17:33:20 -0400

Dan Melamed writes:

>This is a clear example for the original query-poster of how
>task-specific stoplists must be. For IR, you'd want all highly
>frequent words in your stoplists, regardless of their part of speech
>(e.g. "web" is stoplisted on most Web search engines.) On the other
>hand, for author ID, you'd want to look *only* at closed-class words,
>as demonstrated in _The Federalist Papers_.

That depends. Even a word like "web" on the internet, or "computer"
in a Computer Science collection doesn't necessarily have to be on
a stop-word list. It depends on how much it would cost in terms of
space, and how it would affect the statistics. Those words aren't
as common as "the", or "of" or "with". Also, we might have a closed
class word in a query which is relatively infrequent (such as "amongst"),
and if it isn't on a stop-word list, it would be given a high weight.
That isn't necessarily what we want (it's possible that "amongst"
*could* be a good clue to relevant documents). Also, we do *not* want
all frequent words in the stoplist, regardless of their part of speech.
A word such as "down" should not be in the stopword list when it
is used as a noun.

Bob