Corpora: stop-word lists

Bob Krovetz (krovetz@research.nj.nec.com)
Mon, 20 Oct 1997 15:05:40 -0400

Adam Kilgarriff writes:

>I don't think there is any theoretical justification for stop lists.
>The implicit assumption in much IR is that content can be assessed in
>isolation from form. This looks highly dubious, particularly now that
>the 'corpus' of greatest interest is not some tidy set of research
>abstracts, but the anything-goes mess of the web.

I don't agree that there is such an assumption. My own work has been
very concerned with the relationship between content and form in IR,
and I think there *is* interest in the IR community on the way form
can be used to characterize content. But their concern is not on
content assessment per se, but in the way that this is used to improve
retrieval performance.

The corpora used in IR are not just a tidy set of research abstracts.
The Tipster collection contains several years worth of Wall Street
Journal articles, Associated Press articles, Federal Register documents
(pretty awful, but I wouldn't describe them as research abstracts),
and other sources. It is comparable in size to the British National
Corpus.

Even if an IR collection were to consist of a "tidy set of research
abstracts", what does that have to do with the relationship between
content and form?

Part of the reason for stop-lists is their interaction with statistics,
and part is due to space efficiency (indexing on all the words in the
common stop-word lists would double the size of the index).

Bob

krovetz@research.nj.nec.com