Similarity and distance measures for document clustering

Bruce Lambert (bruce@ludwig.pmad.uic.edu)
Thu, 7 Dec 95 16:12:19 -0600

Hi folks,

I've been doing some document clustering experiments. Basically, I represent
documents as vectors of idf term weights, then I do a within groups average
hierarchical clustering in SPSS using cosine as my vector similarity measure.
Lately, though, I've been using SAS to do the clustering because SAS provides
useful cluster stopping criteria (e.g., the cubic clustering criterion,
pseudo-F and pseudo-T^2). However, SAS requires a distance matrix to do
clustering, rather than a similarity matrix.

So, my question is this:

If cosine is my similarity measure, what would be the corresponding distance
measure. I think there are two possibilities.

1. Use 1-cosine.
2. Use sine (i.e., positive square-root of 1-cos^2).

Is one of these the "correct" distance measure? Comments appreciated.

Bruce Lambert, Ph.D.
University of Illinois at Chicago