Once again, the contact point for this parabola (i.e. the
)
is the value for the previous model
or
the mean of the gate before any iteration. Let us denote the previous
parameters of the gate as
and
.
To
facilitate the derivation, we can perform what is commonly called a
whitening operation on the coordinate system for each gate. For
each gate, we can consider that its previous parameters where
zero-mean and identity covariance. This is done by applying an affine
transformation to the
-space (i.e. shifting the data points
appropriately). The data is translated by the mean and
scaled by the inverse of the square root of the covariance
matrix. In theory, nothing changes except that each gate observes
affinely displaced data instead and it then seems to have a zero mean
and identity covariance as its previous operating point. This
whitening operation does not need to be done explicitly
(i.e. numerically) but helps simplify the following derivation for the
bound.
Thus, without loss of generality, we consider that each data point has
been appropriately whitened for each gate. The result is that
and
.
Using
this assumption, we can compactly solve for the parameters of the
bounding parabola (
). This is done by
meeting the three requirements (contact, tangentiality and
tightness). The results is shown in
Equation 7.23.
As shown earlier, if a globally valid bound is desired, the value of
wim must always be greater than the expression in
Equation 7.23 no matter what the value of
.
Thus we can find the minimum wim (i.e. wminim)
for which this is true and form a parabolic bound for the Q function
that is as 'tight' as possible. We shall now determine an efficient
way to compute wminim without searching the high dimensional
space of
each iteration.
Note in the expression for wim above that
is used
only in an inner product with itself (
)
or in
an inner product with the whitened data point
.
Thus, we can represent these two quantities
regardless of the dimensionality of the vector space using at most two
degrees of freedom: the magnitude of
and the
cosine of the angle between the vectors
and
.
Let us now denote the new variables as follows:
is
the magnitude of
,
is the magnitude of
and
is the cosine of the angle between the two
vectors. We then compactly rewrite the bound as in
Equation 7.24.
The wim is now lower bounded by the right hand side which is a
function of two free scalars
and
.
We wish to find
the maximum of the lower bound at some value of
and
.
By taking derivatives of the right hand side with respect to
and setting to 0, we find an extremum of the function at
.
However, we also note that
and so
.
If we set
we note that the this solution
is always greater than the
.
So,
the extremum we solved for was a minimum. Thus, the maximum
occurs at the ends of the interval so
.
Therefore, we
have the relationship
.
In other words, let
us define
.
The expression for wim then
simplifies into a function of 1 free variable (c) as in
Equation 7.25.
For each value of
we find the value of c which maximizes the
value of
.
This is done offline (taking a few seconds)
using Brent's method and the result is stored in a 1D lookup table
(LUT). The values are shown in Figure 7.3(a) where we
plot the max of the g function with respect to
.
Using this
table, we can immediately compute the minimum wminim by looking
up the corresponding value of
for the current data point and
subsequently compute the rest of the parabolic bound's parameters.
An example of a parabolic bound is shown in Figure 7.3(b) under a sub-component of the Qfunction. This sub-component corresponds to a single data point as we vary one gate's uni-variate mean. Many such parabolic bounds are computed (one for each data point and gate combination) and are summed to bound the Q function in its entirety.
For greater computational efficiency we compute the gradient of the
Q on the mean to get an idea of the direction of steepest
ascent. We can then know which general direction
will move
towards for the next iteration. Due to whitening,
starts off
at the origin and moves from there. Thus, we know if
is
moving close towards
or farther away from
it. Equivalently, we know if c will be positive or negative. The
above LUT (Figure 7.3(a)) returns the wminim which
is valid for all possible choices of c. However, if we know the
direction c is going to move along, we only need to consider half of
the possibilities in using Brent's method to find the max of
.
Thus, we obtain two LUTs for c positive and for cnegative as shown in Figure 7.4(a). This
typically gives us an even tighter bound since we only need to bound
the Q function in regions of the landscape where the locus of the
maximum could move to. Thus, we are effectively approximating the
landscape with two parabolas, which are cut at c=0. One is a bound
for the values at c<0 and one for
c>0. Figure 7.4(b) depicts this piece wise
parabolic bound on a sub-component of Q. The function being bounded
is represented with a continuous line. One parabola is composed of 'x'
symbols and is a valid bound to the right of zero. The other parabola
is composed of 'o' symbols and is valid to the left of zero. The
piece-wise bound is composed of these two parabolas spliced at the
origin (their continuations are shown as dotted lines for
visualization). Thus, if the gradient tells us that we will be moving
in a direction of increasing
,
we should compute the parabola
made of 'x' symbols. However, if we are moving to the left (because of
the sum total influence of the other bounds and functions), we should
use the parabola made of 'o' symbols which is a wider and tighter
bound.
To obtain a final answer for the update of the gate means
we
simply maximize the sum of parabolas which bounds the Q in its
entirety. For M gates and N data points, we obtain
parabolas each with its own (
). The
update is in Equation 7.26.