entropy optimization is closely related to maximum likelihood methods.
this can be seen very clearly in the cae of a sequence of independent
trials where the estimated probability of the entire sequence is
q(x_1 ... x_n) = prod_i q(x_i)
where i am using q rather than p to emphasize that this is *our*
*estimate* of the likelihood, not the true probability.
taking the log (log is monotonic, and sums are easier to deal with
than products),
log q(x_1 ... x_n) = sum_i log q(x_i)
if the law of large numbers applies (let's just pretend we have
examined all the conditions), then this sum will be well approximated
by the expected value of q, or
log q(x_1 ... x_n) \approx n sum_x p(x) log q(x)
where the sum is now over all possible events and we have introduced
p(x), which is of course TRUTH. it isn't hard (but there is a trick)
to show that this last sum is maximized exactly where p(x) = q(x),
i.e. where we have found the true probability of all events.
but this last expression is the same as the -n times the cross entropy
between p and q. thus, entropy minimization is the same as
probability maximization. it is also the same as optimizing file
compression for the same reasons (as steve finch mentioned).
I wonder how they fit into the picture.
i hope that this helps.
the problem with applying these techniques with something like mixed
distributions is that you have to use some sort of iterative
optimization method to solve for q. for simple things like a normal
distribution, or the multinomial, or Markov models, or the poisson,
the solution for the parameters is straightforward. but with a mixed
distribution, the solution isn't so simple and may, in fact, be
intractable for one reason or another.
the key thing about fator analysis is that if you assume that the
distribution in question is a mixed normal distribution, then you can
use some astoundingly clever linear algebraic algorithms (thank you
mr. lanczos) to find enough of the solution in a short time to be very
useful. is is quite possible to do something similar where there are
just a few factors and just a few symbols being generated by the mixed
multinomials (my early tests used two factors and 20 symbols), but
expanding this to dozens to hundreds of factors and tens of thousands
of symbols which have radically different probabilities, then the
algorithmics are difficult.
I'm thinking in particular of two papers:
...
i wish i could actually look at these right away, but i am far from a
real library right now. my guess is that they are dealing with a
relatively small problem compared to text analysis, but guessing in a
vacuum is clearly a pretty dangerous occupation. i would welcome
corrections if anybody were to post a summary or abstract (or even
direct me to softcopy of the articles in question).