Bounding Vectors: The Gate Means

Next: Bounding Matrices: The Gate Up: CEM for Gaussian Mixture Previous: Bounding Scalars: The Gate

Bounding Vectors: The Gate Means

Upon close observation, the above technique of taking derivatives of Q and setting them to zero seems to be intractable for the case of the means. The resulting derivative contains transcendental functions which can not be easily isolated to obtain an update equation. What we can do at this point is bound the Q function specifically for the means to get a nicer expression which can be easily maximized. We introduce the previously discussed parabolic bounds for the Qfunction as in Equation 7.21. Or, equivalently, each pair of gate and data point is individually bounded by a parabola as in Equation 7.22 (with height parameter k_im, width parameter w_im and location parameter $\gamma_{im}$ ).

$\displaystyle \begin{array}{lll} Q(\Theta^t,\Theta^{(t-1)}) & = & \sum_{i=1}^N ... ... \sum_{m=1}^M k_{im} - w_{im} \Vert \mu_x^m-\gamma_{im} \Vert^2 \\ \end{array}$

(7.21)

$\displaystyle \begin{array}{ccc} Q(\Theta^t,\Theta^{(t-1)})_{im} & \geq & k_{im... ...: \:\:\:\:\: \Longrightarrow {\rm equality \:\: at \:\:} \mu_x^{m*} \end{array}$

(7.22)

Once again, the contact point for this parabola (i.e. the $\mu_x^{m*}$ ) is the value for the previous model $\Theta^{(t-1)}$ or the mean of the gate before any iteration. Let us denote the previous parameters of the gate as $\mu_x^{m*}$ and $\Sigma_{xx}^{m*}$ . To facilitate the derivation, we can perform what is commonly called a whitening operation on the coordinate system for each gate. For each gate, we can consider that its previous parameters where zero-mean and identity covariance. This is done by applying an affine transformation to the ${\bf x}$ -space (i.e. shifting the data points ${\bf x}_i$ appropriately). The data is translated by the mean and scaled by the inverse of the square root of the covariance matrix. In theory, nothing changes except that each gate observes affinely displaced data instead and it then seems to have a zero mean and identity covariance as its previous operating point. This whitening operation does not need to be done explicitly (i.e. numerically) but helps simplify the following derivation for the bound.

Thus, without loss of generality, we consider that each data point has been appropriately whitened for each gate. The result is that $\mu_{x}^{m*} := 0$ and $\Sigma_{xx}^m = \Sigma_{xx}^{m*} = I$ . Using this assumption, we can compactly solve for the parameters of the bounding parabola ( $k_{im},w_{im},\gamma_{im}$ ). This is done by meeting the three requirements (contact, tangentiality and tightness). The results is shown in Equation 7.23.

$\displaystyle \begin{array}{lll} k_{im} & = & {\hat h}_{im} ( \log {\cal N} ({\... ...}_i}^T{\mu_x^m} } {{\mu_x^m}^T {\mu_x^m}} + \frac{{\hat h}_{im}}{2} \end{array}$

(7.23)

As shown earlier, if a globally valid bound is desired, the value of w_im must always be greater than the expression in Equation 7.23 no matter what the value of $\mu_x^m$ . Thus we can find the minimum w_im (i.e. wmin_im) for which this is true and form a parabolic bound for the Q function that is as 'tight' as possible. We shall now determine an efficient way to compute wmin_im without searching the high dimensional space of $\mu_x^m$ each iteration.

Note in the expression for w_im above that $\mu_x^m$ is used only in an inner product with itself ( ${\mu_x^m}^T {\mu_x^m}$ ) or in an inner product with the whitened data point ${{\bf x}_i}^T{\mu_x^m}$ . Thus, we can represent these two quantities regardless of the dimensionality of the vector space using at most two degrees of freedom: the magnitude of $\mu_x^m$ and the cosine of the angle between the vectors $\mu_x^m$ and ${\bf x}_i$ . Let us now denote the new variables as follows: $\lambda$ is the magnitude of $\mu_x^m$ , $\rho$ is the magnitude of ${\bf x}_i$ and $\zeta = \cos(\phi)$ is the cosine of the angle between the two vectors. We then compactly rewrite the bound as in Equation 7.24.

$\displaystyle \begin{array}{lll} w_{im} & \geq & r_i \alpha_m \frac{ e^{-\frac{... ...zeta\lambda\rho -1 \right ) } {\lambda^2} + \frac{{\hat h}_{im}}{2} \end{array}$

(7.24)

The w_im is now lower bounded by the right hand side which is a function of two free scalars $\lambda$ and $\zeta$ . We wish to find the maximum of the lower bound at some value of $\lambda$ and $\zeta$ . By taking derivatives of the right hand side with respect to $\zeta$ and setting to 0, we find an extremum of the function at $\zeta = \frac{\lambda}{2 \rho}$ . However, we also note that $\zeta = \cos(\phi)$ and so $\zeta \epsilon [-1,1]$ . If we set $\zeta = 1$ we note that the this solution is always greater than the $\zeta = \frac{\lambda}{2 \rho}$ . So, the extremum we solved for was a minimum. Thus, the maximum occurs at the ends of the interval so $\zeta = \pm 1$ . Therefore, we have the relationship $\mu_x^m \propto {\bf x}_i$ . In other words, let us define $\mu_x^m = c{\bf x}_i$ . The expression for w_im then simplifies into a function of 1 free variable (c) as in Equation 7.25.

$\displaystyle \begin{array}{lll} w_{im} & \geq & r_i \alpha_m e^{-\frac{1}{2} \... ...ac{1}{2} \rho^2} \max \{ g(c,\rho) \} + \frac{{\hat h}_{im}}{2} \\ \end{array}$

(7.25)

For each value of $\rho$ we find the value of c which maximizes the value of $g(c,\rho)$ . This is done offline (taking a few seconds) using Brent's method and the result is stored in a 1D lookup table (LUT). The values are shown in Figure 7.3(a) where we plot the max of the g function with respect to $\rho$ . Using this table, we can immediately compute the minimum wmin_im by looking up the corresponding value of $\rho$ for the current data point and subsequently compute the rest of the parabolic bound's parameters.

**Figure 7.3:** Look-Up Table and Example Mean Parabolic Bound
$\begin{figure}\center \begin{tabular}[b]{cc} \epsfxsize=2.4in \epsfbox{LUTbot... ... Parabolic Width LUT & (b) 1D Mean Bounding Example \end{tabular} \end{figure}$

An example of a parabolic bound is shown in Figure 7.3(b) under a sub-component of the Qfunction. This sub-component corresponds to a single data point as we vary one gate's uni-variate mean. Many such parabolic bounds are computed (one for each data point and gate combination) and are summed to bound the Q function in its entirety.

Parabolic Piece-Wise Bound
For greater computational efficiency we compute the gradient of the Q on the mean to get an idea of the direction of steepest ascent. We can then know which general direction $\mu_x^m$ will move towards for the next iteration. Due to whitening, $\mu_x^m$ starts off at the origin and moves from there. Thus, we know if $\mu_x^m$ is moving close towards ${\bf x}_i$ or farther away from it. Equivalently, we know if c will be positive or negative. The above LUT (Figure 7.3(a)) returns the wmin_im which is valid for all possible choices of c. However, if we know the direction c is going to move along, we only need to consider half of the possibilities in using Brent's method to find the max of $g(c,\rho)$ . Thus, we obtain two LUTs for c positive and for cnegative as shown in Figure 7.4(a). This typically gives us an even tighter bound since we only need to bound the Q function in regions of the landscape where the locus of the maximum could move to. Thus, we are effectively approximating the landscape with two parabolas, which are cut at c=0. One is a bound for the values at c<0 and one for c>0. Figure 7.4(b) depicts this piece wise parabolic bound on a sub-component of Q. The function being bounded is represented with a continuous line. One parabola is composed of 'x' symbols and is a valid bound to the right of zero. The other parabola is composed of 'o' symbols and is valid to the left of zero. The piece-wise bound is composed of these two parabolas spliced at the origin (their continuations are shown as dotted lines for visualization). Thus, if the gradient tells us that we will be moving in a direction of increasing $\mu$ , we should compute the parabola made of 'x' symbols. However, if we are moving to the left (because of the sum total influence of the other bounds and functions), we should use the parabola made of 'o' symbols which is a wider and tighter bound.

**Figure 7.4:** Look-Up Table and Example Mean Piece-Wise Parabolic Bound
$\begin{figure}\center \begin{tabular}[b]{cc} \epsfxsize=2.4in \epsfbox{LUTbot... ... Parabolic Width LUT & (b) 1D Mean Bounding Example \end{tabular} \end{figure}$

To obtain a final answer for the update of the gate means $\mu_x^m$ we simply maximize the sum of parabolas which bounds the Q in its entirety. For M gates and N data points, we obtain $M \times N$ parabolas each with its own ( $wmin_{im},k_{im},\gamma_{im}$ ). The update is in Equation 7.26.

$\displaystyle \begin{array}{lll} \mu_x^m & := & \frac{ \sum_{i=1}^N wmin_{im} \gamma_{im}}{\sum_{i=1}^N wmin_{im}} \end{array}$

(7.26)

Next: Bounding Matrices: The Gate Up: CEM for Gaussian Mixture Previous: Bounding Scalars: The Gate

Tony Jebara
1999-09-15