Re: Corpora: Stop-list etc.

Paul Holmes-Higgin (paul@resumix.co.uk)
Wed, 22 Oct 1997 09:55:48 +0100

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Chris Tribble: "Re: Corpora: spellcheckers"
Previous message: Lingua Systems: "Re: Corpora: spellcheckers"
Maybe in reply to: Einat Amitay: "Corpora: Stop-list etc."

I have to agree with Ken and Adam with respect to stop-lists - I've
always viewed them as a solution for a computation time/space
problem. For our CV (resume) processing system, high-frequency
words are essential for correctly identifying the content "terms"
for subsequent retrieval - when you're trying to match tens of
thousands of candidates against a job, you can't afford to have too
much scope for incorrect hits (particularly when your customers make
up a good part of Fortune 100 companies!).

I realise that a CV (resume) is a highly structured document and
relatively short (4-10 pages in the UK), and that we are placing some
interpretation on what is needed for retrieval. The classic example
for us is the difference between:

I work as the manager of an IT division
and
I work for the manager of an IT division

I always wondered why full text search systems were called full text
when they're missing 25% of a text - now I'm in the commercial world,
I know.

Regards
Paul

---
Dr Paul Holmes-Higgin
Director of Technology
Resumix Limited

-----Original Message-----
From: Ken Church <kwc@research.att.com>
To: ted@aptex.com <ted@aptex.com>; tdunning@aptex.com <tdunning@aptex.com>;
corpora-request@lists.uib.no <corpora-request@lists.uib.no>;
kwc@research.att.com <kwc@research.att.com>
Date: 22 October 1997 00:56
Subject: Re: Corpora: Stop-list etc.


>
>While there is much truth to what Ted is saying, one can argue that
>there may be more to the story than just statistical considerations.
>
>It is interesting to compare and contrast Information Retrieval and
>Author Identification.  Both fields use basically the same methods,
>except for the weighting strategy.  Content words are good
>discriminators for Information Retrieval whereas stylistic words are
>good discriminators for Author Identification.  I can see how the
>standard statistical considerations would discover much of this
>weighting strategy, especially for high frequency words, but I don't
>see how standard statistical considerations would capture the relevant
>distinctions for low frequency words.  I've argued elsewhere that the
>weighting scheme needs at least 2 variables (term frequency + ???) in
>order to capture the 4 possibilities:
>
>    | STYLISTIC CONTENT
>-----------------------------------------------------
>HIGH FREQ  | the government
>LOW FREQ   | whereas aardvark
>
>One variable (e.g., term frequency) can only make a two-way
>distinction (e.g., 'the' vs. 'aardvark'), which isn't enough.  There
>are lots of different ways to think about the second variable:
>burstiness, variance over documents, IDF, semantic content, etc.  My
>hunch is that these are all basically equally good, but I can't defend
>this hunch right now.
>
>At any rate, I think Adam's question is really quite deep and deserves
>a lot of thought.
>
>Ken Church
>
>    Date: Mon, 20 Oct 1997 10:12:25 -0700
>    Reply-To: "Ted E. Dunning" <ted@aptex.com>
>    From: "Ted E. Dunning" <ted@aptex.com>
>    To: Adam.Kilgarriff@itri.brighton.ac.uk
>    CC: einat@cogsci.ed.ac.uk, corpora@hd.uib.no, korin@cstr.ed.ac.uk
>    In-reply-to: <199710201059.LAA00435@cabral.itri.brighton.ac.uk>
(Adam.Kilgarriff@itri.brighton.ac.uk)
>    Subject: Re: Corpora: Stop-list etc.
>    Reply-to: tdunning@aptex.com
>    Sender: owner-corpora@lists.uib.no
>    Precedence: bulk
>    Resent-Date: Mon, 20 Oct 1997 19:14:54 +0200
>    Content-Type: text
>    Content-Length: 1384
>
>
>
>    actually, adam is missing a very important fact about IR systems which
>    does give a principled reason for using stop lists.
>
>    in virtually all of the leading retrieval systems which support ranked
>    retrieval (there are some oddballs in this mix, but only a few), the
>    weight assigned to a retrieval term is inversely proportional to the
>    frequency of the term.  any term which appears in every document is
>    given zero or near zero weight.
>
>    given this fact, it is an obvious economy to not store the information
>    about the occurrence of these words.  this is very similar to other
>    sparse matrix techniques which avoid storing information about zero
>    elements.  since most IR systems are at their hearts simply very large
>    matrix transpose and multiply systems, it is hardly surprising that
>    sparse matrix implementation techniques are used as much as possible.
>
>
>    >>>>> "ak" == Adam Kilgarriff <Adam.Kilgarriff@itri.brighton.ac.uk>
writes:
>
> ak> Einat Amitray wrote:
>
> >> I'm not looking for the "right" list of words, but for the
> >> reason behind using stop-lists at all. Is there an article
> >> about the "'why's & 'why-not's?
>
> ak> ... So the obvious
> ak> hack is to exclude them.
>
> ak> I don't think there is any theoretical justification for stop
> ak> lists.  The implicit assumption in much IR is that content can
> ak> be assessed in isolation from form. ...
>
>
>

Next message: Chris Tribble: "Re: Corpora: spellcheckers"
Previous message: Lingua Systems: "Re: Corpora: spellcheckers"
Maybe in reply to: Einat Amitay: "Corpora: Stop-list etc."