Re: Corpora: Stop-list etc.

Ted E. Dunning (ted@aptex.com)
Mon, 20 Oct 1997 10:12:25 -0700

actually, adam is missing a very important fact about IR systems which
does give a principled reason for using stop lists.

in virtually all of the leading retrieval systems which support ranked
retrieval (there are some oddballs in this mix, but only a few), the
weight assigned to a retrieval term is inversely proportional to the
frequency of the term. any term which appears in every document is
given zero or near zero weight.

given this fact, it is an obvious economy to not store the information
about the occurrence of these words. this is very similar to other
sparse matrix techniques which avoid storing information about zero
elements. since most IR systems are at their hearts simply very large
matrix transpose and multiply systems, it is hardly surprising that
sparse matrix implementation techniques are used as much as possible.

>>>>> "ak" == Adam Kilgarriff <Adam.Kilgarriff@itri.brighton.ac.uk> writes:

ak> Einat Amitray wrote:

>> I'm not looking for the "right" list of words, but for the
>> reason behind using stop-lists at all. Is there an article
>> about the "'why's & 'why-not's?

ak> ... So the obvious
ak> hack is to exclude them.

ak> I don't think there is any theoretical justification for stop
ak> lists. The implicit assumption in much IR is that content can
ak> be assessed in isolation from form. ...