An important insight can be gleaned from the deterministically annealed joint density estimation approaches and their recent success. Although the annealing approach hasn't quite become a standard mechanism in the learning community, it has been used more and more in very straightforward estimation problems (i.e. Gaussian mixture models mostly). Simply put, annealing in EM-type algorithms involves placing a distribution over the contact point (i.e. current operating point) x*. Thus, the algorithm starts with a relatively flat distribution (corresponding to high temperature) indicating that it does not know where to initialize its local maximum search. Eventually, as the optimization iterates, this distribution gets 'tightened' and narrows towards a more localized x* as the temperature is cooled and the algorithm behaves as a local optimizer. Thus, the optimization is less sensitive to the initialization point and the local maximum basin it happens to start in.
In the spirit of the above annealing process, we can imagine a
different type of metaphor. Consider the function we are optimizing
(i.e.
)
which is speckled with many local maxima. Now
consider a heavily smoothed version of
which we shall
call
(where t is proportional to the amount of
smoothing). By smoothing, imagine flattening out the function and its
many little local maxima by convolving it with, say, a Gaussian
kernel. Evidently, after enough convolutions or smoothings,
will start looking like a single maximum function and more global
maximization is possible. Thus, we can run a local optimization scheme
on
and get to its single global maximum. As we reduce
the degree of smoothness,
starts to look more and more
like the original function
and we will begin optimizing
the true function locally. However, the current operating point has
now been placed in the basin of a more global maximum. It is important
to note that here the term 'global maximum' is somewhat of a misnomer
since a small 'peaky' global maximum might get ignored in favor of a
wider maximum. This is due to the width of the effective basin around
the sub-optimal maximum which has an effect on how dominant it remains
after smoothing. The higher a maximum is and the more dominant and
smooth the basin that surrounds it, the more likely it will remain
after heavy smoothing.
Actually, it could be argued that this peculiarity of the 'smoothing based annealing' here is an advantage rather than a disadvantage. It is well known that robustness (of a maximum) is often a critical feature of optimization and estimation. By robustness we adopt the following definition: a robust maximum is a high maximum which is somewhat insensitive to perturbations in the locus or the parameters. This is similar to the Bayes' risk argument. Thus, the annealing process described here should move the optimization algorithm into a wider basin which is typically more robust. A sharp, peaky true global maximum might be the result of some artifact in the function being optimized (i.e. like overfitting in estimation problems). This kind of abnormality might not be typical of the landscape of the function and therefore should be ignored in favor of a wider more robust maximum.
Of course in practice, explicit smoothing of the function
is a cumbersome task either numerically or analytically. So, what we
shall do is smooth the function implicitly. This is done by
artificially widening the parabolas that are used to bound. In
essence, we are paying more attention to wider parabolas in the
optimization. To accomplish this, every w term or W matrix is
taken to a power less than one (for higher temperatures, the power
factor is closer to zero). This is shown more explicitly in
Equation 6.26.
Of course, T is a temperature parameter which has to be varied using a pre-determined schedule. We will use the traditional exponential decay schedule which starts the temperature off near infinity and drops it towards T=1. At infinity, the parabolas are effectively flat lines. Thus, the hypothetical function they bound is also flat and has in fact 0 local minima. At T=1 the system reverts to the typical local maximization performance described earlier.
In Figure 6.11 some examples of the annealing are shown on combinations of sinusoids. Almost all the initialization points lead to convergence to the same global maximum. The exponential decay coefficient used was 0.4 and global convergence was obtained in about 5 iterations every time.
Effectively, the annealing technique considers local maxima as basins with a volume (in 1D this is just a height and width). The larger (wider and taller) basins will dominate as we smooth the function and force convergence towards a more global solution. Also note that deterministic annealing is quite different from simulated annealing. Only a few steps were required to achieve global convergence. In simulated annealing, random steps are taken to 'skip' out of local maxima in hopes of landing in a better basin. This is typically a less efficient search technique than a deterministic optimization approach.