Re: Corpora: Stop-list etc.

Dan Melamed (melamed@unagi.cis.upenn.edu)
Mon, 20 Oct 1997 13:48:26 -0400 (EDT)

> frequency of the term. any term which appears in every document is
> given zero or near zero weight.
>
> given this fact, it is an obvious economy to not store the information
> about the occurrence of these words. this is very similar to other
>
> >>>>> "ak" == Adam Kilgarriff <Adam.Kilgarriff@itri.brighton.ac.uk> writes:
>
> ak> I don't think there is any theoretical justification for stop
> ak> lists. The implicit assumption in much IR is that content can
> ak> be assessed in isolation from form. ...
>
>

The disagreement between Ted and Adam seems to arise from different
assumptions about what IR stop lists contain. It seems that Adam's
assumption is that stop lists contain (mostly?) closed-class words,
which is indeed the case in some systems, whereas Ted's assumption is
that stoplists contain frequent words. Obviously, not all
closed-class words are frequent and not all frequent words are from
the closed classes.

This is a clear example for the original query-poster of how
task-specific stoplists must be. For IR, you'd want all highly
frequent words in your stoplists, regardless of their part of speech
(e.g. "web" is stoplisted on most Web search engines.) On the other
hand, for author ID, you'd want to look *only* at closed-class words,
as demonstrated in _The Federalist Papers_.

I. Dan Melamed melamed@linc.cis.upenn.edu
University of Pennsylvania http://www.cis.upenn.edu/~melamed/