Re: Similarity and distance measures for document clustering

Ted Dunning (ted@crl.nmsu.edu)
Thu, 7 Dec 1995 15:52:34 -0700 (MST)

If cosine is my similarity measure, what would be the corresponding
distance measure. I think there are two possibilities.

1. Use 1-cosine.
2. Use sine (i.e., positive square-root of 1-cos^2).

Is one of these the "correct" distance measure? Comments appreciated.

there probably isn't any "correct" distance measure.

1-cosine is historical a bad thing to use for small angles (which is
why versines and haversines were invented). of course, in umpty umpty
thousand dimensional space, small angles are hard to come by.

you could also try using something like any of the many statistical
tests for difference. everybody here must know by now that one of my
favorites is G^2.

depending on your clustering algorithm, just using a rank score may
work as well as anything else. most clustering algorithms are *not*
going to like this, though.