Re: Stop Word Lists

Amsler, Robert (amsler@usmd1.dyniet.com)
Fri, 13 Sep 96 16:05:00 EST

It should be clear why there are no standard stop word lists.

Stop words are an application dependent concept. They apply to the
particular database one is searching. If one is searching computer science
literature, then the word "computer" is usually put on the stop-word list by
the database creators.

It is commonly assumed that words which are not members of the
noun-verb-adjective classes should be on stopword lists. The difficulty here
is that most high frequency words have multiple parts of speech, er...
syntactic categories.
The word "down" is a adverb, preposition, transitive verb, adjective, and
has 3 distinct homographic noun senses with meanings in sports, geography
and clothing.
(I wonder whether there are whole classes of stories about "down" that
nobody has ever seen because IR systems keep excluding them for access as
function words).

There are generally about 100 words that appear in the high frequency list
of a large corpus before the first noun that does not also have a function
word use (which is usually "time", incidentally).

It all comes down to a matter of convenience and not science. One puts words
on stopword lists because the number of hits one gets to these words exceeds
one's level of tolerance for false hits in a given application. This
judgment may be idiosyncratic to the censor making the decision and should
probably be made solely on system performance criteria, rather than adding
stopwords which are deemed
useless merely because of their function word status, i.e. consider the
following queries...

"Find all quotations to: "To be or not to be" in a corpus."

"Find news stories about a former US TV show called "The A Team"

"How much "down" is used in the fashion industry per year?"

If one has the option to allow ALL words to be accessible if the user asks
for them within quotes or via some other alternative means, then that would
probably be preferable to not indexing them at all. One could even apply the
reverse heuristic that some words are assumed to be present in all articles
unless negatively indexed as NOT being in those items. Then "stopwords"
would merely be words whose index entries give the entries in which they DO
NOT appear, because that list is smaller than the list of entries in which
they do appear.