factor analysis

Ted Dunning (ted@crl.nmsu.edu)
Sun, 6 Aug 1995 22:16:47 -0600

Has anyone used, or does anyone know of someone who has used
factor analysis as a way of generalizing syntactic
or collocational structure?

there is an inherent problem with doing this. by using factor
analysis, you are implicitly using a least squares ideal for fitting.
this is essentially equivalent to the assumption of the normal
distribution and is pretty seriously unjustified. you will get good
results in some cases (with common words, or manufactured examples)
and pretty seriously strange results (such as negative predicted word
counts) in others. this is much less of a problem when you have large
units of text as in information retrieval (see the Latent Semantic
Indexing papers) or with sentence level or larger coocurrence units
(see Hinrich Schuetze's ongoing saga). even in these cases, however,
the proponents of factor analysis have to rationalize pretty heavily
to deal with the problems posed by the basic least squares approach.

my own strong preference is the use of likelihood ratio tests based
either on binomial or multinomial models. these tests handle the
problem of small counts very well and can analyse situations where
common and rare words are mixed together. an early bit of work i did
on this topic was described in Computational Linguistics volume 19,
number 1. as stated in that article, i can supply software which
imlements the statistical methods described. i would also be happy to
assist (to a reasonable degree) people who are interested in doing
collocational analyses using this software.

it should be noted that there are relatively simple extensions to
these methods which handle longer phrases or the coocurrence of more
than two words with a specified context.