Re: factor analysis

Steve Finch (steve@cogsci.ed.ac.uk)
Mon, 07 Aug 95 13:26:34 +0100

>Viz a viz least squares problems for factor analysis, this is
>out of my field, but least squares is just one way to calculate
>factors isn't it. I understand it as just one means to an end, and
>incidental to the main point of factoring out correlations to represent
>data in the smallest number of consistant dimensions. If you could find
>ways other than least squares of matching your data points to a model
>distribution then there would not be a problem would there?

And therein lies the problem. This form of data reduction is a form
of (lossy) data compression. There are strong equivalences between
being able to adequately statistically model your data and being able
to compress it. So if you can compress well you can statistically
model your data well (and vice versa). What strikes me about PCA, SVD
(LSI) and FA is that it's a form of "toolism" (common in neural
network papers); throwing existing tools at data WITHOUT bothering to
do the modelling. And, as Ted Dunning points out, the results are
consequently mixed, together with providing proof positive (from PCA
examples) that the little demonstrations we did in class which show
that uncorrelated variables are not necessarily independent form the
usual case, and not the exception.

Of course, noone currently knows the distribution of the data and the
most pertinant statistical regularities, but that is the crux of the
problem, and the underlying model of FA (mixing distributions with
residuals) requires justification.

Technically in FA, least squares arises from the assumption of
normality and independence of the residuals. While it is true that
you can change this asumption, most alternative reasonable choices
lead to intractable solutions, and violations of the assumption of
independence blow the problem out of the water, but are almost
certainly true in reasonable interpretations of the FA model applied
to many NLP problems.

Recall also that you find the factors as well as the mixing
parameters, and least squares makes this easy by matrix algebra.
Changing this would usually entail a gradient optimatation procedure
and the encumbant problems of long training times and local minima.

Consequently, it is unlikely that simply switching from least-squares
to another tractable fitting function would help matters. However it
would be interesting to find out whether gradient/simm. annealing
techniques with a multinomial residual perform better or worse than
standard FA or can optimise a standard FA solution for a particular
task. I would imagine training times would be horrible, however.

Cheers,

Steve.