We will begin with a relatively compact perceptual system that will be used for gesture behaviour learning. A tracking system is used to follow head and hand as three objects (head, left and right hand). These are represented as 2D ellipsoidal blobs with 5 parameters each. With these features alone, it is possible to engage in simple gestural games and interactions.
The vision algorithm begins by forming a probabilistic model of skin colored regions [1] [55] [52]. During an offline process, a variety of skin-colored pixels are selected manually, forming a distribution in rgb space. This distribution can be described by a probability density function (pdf) which is used to estimate the likelihood of any subsequent pixel ( ) being a skin colored pixel. The pdf used is a 3D Gaussian mixture model as shown in Equation 3.1 (with M=3 individual Gaussians typically).
The parameters of the pdf (p(i), and ) are estimated using the Expectation Maximization [15] algorithm to maximize the likelihood of the training rgb skin samples. This pdf forms a classifier and every pixel in an image is filtered through it. If the probability is above a threshold, the pixel belongs to the skin class, otherwise, it is considered non-skin. Figures 3.1(a) and (d) depict the classification process.
To clean up some of the spurious pixels misclassified as skin, a connected components algorithm is performed on the region to find the top 4 regions in the image, see Figure 3.1(b). This increases the robustness of the EM based blob tracking. We choose to process the top 4 regions since sometimes the face is accidentally split into two regions by the connected components algorithm. In addition, if the head and hands are touching, there may only be one non-spurious connected region as in Figure 3.1(e).
Since we are always interested in tracking three objects (head and hands) even if they touch and form a single connected region, it is necessary to invoke a more sophisticated pixel grouping technique. Once again, we use the EM algorithm to find 3 Gaussians that this time maximize the likelihood of the spatially distributed (in xy) skin pixels. Note that the implementation of the EM algorithm here has been heavily optimized to require less than 50ms to perform each iteration for an image of size 320 by 240 pixels. This Gaussian mixture model is shown in Equation 3.2.
The update or estimation of the parameters is done in real-time by iteratively maximizing the likelihood over each image. The resulting 3 Gaussians have 5 parameters each (from the 2D mean and the 2D symmetric covariance matrix) and are shown rendered on the image in Figures 3.1(c) and (f). The covariance () is actually represented in terms of its square root matrix, where . Like , the matrix has 3 free parameters ( ) however these latter variables are closer to the dynamic range of the 2D blob means and are therefore preferred for representation. The 5 parameters describing the head and hands are based on first and second order statistics which can be reliably estimated from the data in real-time. In addition, they are well behaved and do not exhibit wild non-linearities. Consequently they are adequate for temporal modeling. More complex measurements could be added in the future but these would typically be more unstable to estimate and might have non-linear phenomena associated with them. The 15 recovered parameters from a single person are shown as a well behaved, smooth time series in Figure 3.2. These define the 3 Gaussian blobs (head, left hand and right hand).
The parameters of the blobs are also processed in real-time via a Kalman Filter (KF) which smoothes and predicts their values for the next frame. The KF model assumes constant velocity to predict the next observation and maintain tracking.