In this chapter, we discuss the extension of the EM (Expectation Maximization) algorithm for joint density estimation to conditional density estimation and derive the resulting CEM (Conditional Expectation Maximization) algorithm. The machine learning system that is central to the Action-Reaction Learning paradigm requires conditional densities to model interaction. Thus, the optimization techniques described in the preceding chapter will be called upon to formally derive the necessary algorithm. The use of CEM is demonstrated for a specific probabilistic model, a conditioned mixture of Gaussians, and implementation and machine learning issues are discussed.
In joint density estimation, the tried and true EM (expectation maximization) algorithm proceeds by maximizing likelihood over a training set. By introducing the concept of missing data, the algorithm breaks down what would be a complex density optimization into a two-step iteration. The missing unknown data components are estimated via the E-step and then a far simpler maximization over the complete data, the M-step, is performed. This is in fact identical to the iteration described earlier for General Bound Maximization. The E-step basically computes a bound (via Jensen's inequality) for the likelihood and the M-step maximizes that bound. Iteration of this process is guaranteed to converge to a local maximum of likelihood.
Recall the maximum likelihood joint density estimation (ML) case in
Equation 5.2. Imagine we are faced with such a joint
density optimization task and need to compute an optimal
over
sets
and
of N points as in
Equation 7.1.
The EM algorithm is an iterative optimization which begins seeded at
an initial starting point in the model space
(
). Instead of maximizing the expression in
Equation 7.1, one can instead maximize its iterative form
in Equation 7.2 where
is the
algorithm's estimate for a better model from the previous estimate
.
The density
in general is not
simple and is often better described by the discrete or continuous
summation of simpler models. This is always true when data is
missing. In other words, it is a mixture model as in
Equation 7.3. The summation is performed over what
is typically called the 'missing components' (
or m) which
are individually more tractable. Thus, the original rather complex pdf
can be considered to be a linear integration or summation of simpler
pdf models.