Re: Corpora: Stop-list etc.

Paul Holmes-Higgin (paul@resumix.co.uk)
Wed, 22 Oct 1997 09:55:48 +0100

I have to agree with Ken and Adam with respect to stop-lists - I've
always viewed them as a solution for a computation time/space
problem. For our CV (resume) processing system, high-frequency
words are essential for correctly identifying the content "terms"
for subsequent retrieval - when you're trying to match tens of
thousands of candidates against a job, you can't afford to have too
much scope for incorrect hits (particularly when your customers make
up a good part of Fortune 100 companies!).

I realise that a CV (resume) is a highly structured document and
relatively short (4-10 pages in the UK), and that we are placing some
interpretation on what is needed for retrieval. The classic example
for us is the difference between:

I work as the manager of an IT division
and
I work for the manager of an IT division

I always wondered why full text search systems were called full text
when they're missing 25% of a text - now I'm in the commercial world,
I know.

Regards
Paul

---
Dr Paul Holmes-Higgin
Director of Technology
Resumix Limited

-----Original Message----- From: Ken Church <kwc@research.att.com> To: ted@aptex.com <ted@aptex.com>; tdunning@aptex.com <tdunning@aptex.com>; corpora-request@lists.uib.no <corpora-request@lists.uib.no>; kwc@research.att.com <kwc@research.att.com> Date: 22 October 1997 00:56 Subject: Re: Corpora: Stop-list etc.

> >While there is much truth to what Ted is saying, one can argue that >there may be more to the story than just statistical considerations. > >It is interesting to compare and contrast Information Retrieval and >Author Identification. Both fields use basically the same methods, >except for the weighting strategy. Content words are good >discriminators for Information Retrieval whereas stylistic words are >good discriminators for Author Identification. I can see how the >standard statistical considerations would discover much of this >weighting strategy, especially for high frequency words, but I don't >see how standard statistical considerations would capture the relevant >distinctions for low frequency words. I've argued elsewhere that the >weighting scheme needs at least 2 variables (term frequency + ???) in >order to capture the 4 possibilities: > > | STYLISTIC CONTENT >----------------------------------------------------- >HIGH FREQ | the government >LOW FREQ | whereas aardvark > >One variable (e.g., term frequency) can only make a two-way >distinction (e.g., 'the' vs. 'aardvark'), which isn't enough. There >are lots of different ways to think about the second variable: >burstiness, variance over documents, IDF, semantic content, etc. My >hunch is that these are all basically equally good, but I can't defend >this hunch right now. > >At any rate, I think Adam's question is really quite deep and deserves >a lot of thought. > >Ken Church > > Date: Mon, 20 Oct 1997 10:12:25 -0700 > Reply-To: "Ted E. Dunning" <ted@aptex.com> > From: "Ted E. Dunning" <ted@aptex.com> > To: Adam.Kilgarriff@itri.brighton.ac.uk > CC: einat@cogsci.ed.ac.uk, corpora@hd.uib.no, korin@cstr.ed.ac.uk > In-reply-to: <199710201059.LAA00435@cabral.itri.brighton.ac.uk> (Adam.Kilgarriff@itri.brighton.ac.uk) > Subject: Re: Corpora: Stop-list etc. > Reply-to: tdunning@aptex.com > Sender: owner-corpora@lists.uib.no > Precedence: bulk > Resent-Date: Mon, 20 Oct 1997 19:14:54 +0200 > Content-Type: text > Content-Length: 1384 > > > > actually, adam is missing a very important fact about IR systems which > does give a principled reason for using stop lists. > > in virtually all of the leading retrieval systems which support ranked > retrieval (there are some oddballs in this mix, but only a few), the > weight assigned to a retrieval term is inversely proportional to the > frequency of the term. any term which appears in every document is > given zero or near zero weight. > > given this fact, it is an obvious economy to not store the information > about the occurrence of these words. this is very similar to other > sparse matrix techniques which avoid storing information about zero > elements. since most IR systems are at their hearts simply very large > matrix transpose and multiply systems, it is hardly surprising that > sparse matrix implementation techniques are used as much as possible. > > > >>>>> "ak" == Adam Kilgarriff <Adam.Kilgarriff@itri.brighton.ac.uk> writes: > > ak> Einat Amitray wrote: > > >> I'm not looking for the "right" list of words, but for the > >> reason behind using stop-lists at all. Is there an article > >> about the "'why's & 'why-not's? > > ak> ... So the obvious > ak> hack is to exclude them. > > ak> I don't think there is any theoretical justification for stop > ak> lists. The implicit assumption in much IR is that content can > ak> be assessed in isolation from form. ... > > >