Re: Corpora: MWUs and frequency; try Relative Frequency

Rosie Jones (rosie@NL.CS.CMU.EDU)
Fri, 9 Oct 1998 13:46:00 -0400 (ADT)

I think the problem here is that plain frequency lists won't give you
the information you are looking for, and stripping stop-words is an
approximation, but not the best one.

Instead you should look at the ratio of multi-word co-occurrence frequency,
compared to the frequency of the individual words separately. Thus if you
rank multi-word units by

freq(word1 next to word2)
-------------------------
freq(word1) * freq(word2)

you will get something which will rank "hot dog" above "in the"
without any need for stop-lists. You can extend this to arbitrary
numbers of adjacent words.

For more involved techniques you can see for example the paper
by Chengxiang Zhai
"Exploiting Context to Identify Lexical Atoms -- A Statistical View of Linguistic Context"
http://xxx.lanl.gov/abs/cmp-lg/9701001

Rosie Jones rosie@cs.cmu.edu
PhD student, Language Technology Institute,
207 Cyert Hall, Carnegie Mellon University
5000 Forbes Ave Pittsburgh, PA, 15213-3702, USA
http://www.cs.cmu.edu/~rosie/

> We are interested in frequency lists of multi-word units as a resource for
> the automatic indexing of texts. It is sometimes useful to consider MWUs
> such as 'liquid crystal display' and 'hot dog' instead of the individual
> words 'display', 'hot' 'liquid', 'dog' and 'crystal'. For this purpose,
> MWUs such as 'the project' and 'experience of' are obviously irrelevant,
> whereas 'British Council' and 'environmental education' are potentially
> good candidates. In order to get a list which consists of mainly potentially
> good candidates, it is important to disallow (a rather large number of)
> stop words at either end of the expression. Note that stop words should be
> allowed inside the expression so that MWUs such as 'table of contents' and
> 'Member of Parliament' won't be excluded.
>
> As far as I know, WordSmith Tools does not currently provide a facility to
> strip stop words from the ends of MWUs.