Re: Corpora: Stop-list etc.

Adam Kilgarriff (Adam.Kilgarriff@itri.brighton.ac.uk)
Mon, 20 Oct 1997 11:59:46 +0100

Einat Amitray wrote:

> I'm not looking for the "right" list of words, but for the reason behind
> using stop-lists at all. Is there an article about the "'why's &
> 'why-not's?

It all depends what you are doing (of course). Owing to the Zipfian
nature of language, corpus statistics readily become dominated by the
high-frequency, closed-class words of the language. If you are
interested in `content' rather than `form', this is not what you want
as the closed class words tell you nothing (directly) about content.
So the obvious hack is to exclude them.

I don't think there is any theoretical justification for stop lists.
The implicit assumption in much IR is that content can be assessed in
isolation from form. This looks highly dubious, particularly now that
the 'corpus' of greatest interest is not some tidy set of research
abstracts, but the anything-goes mess of the web.

If you are interested in text typology, or author identification, then
you can take the stats as they come, as the closed-class-word info is
clearly of value.

Adam Kilgarriff

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Senior Research Fellow tel: (44) 1273 642919
Information Technology Research Institute (44) 1273 642900
University of Brighton fax: (44) 1273 642908
Lewes Road
Brighton BN2 4GJ email: Adam.Kilgarriff@itri.bton.ac.uk
UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%