One real-time application developed by Jebara and Pentland [32] is the automatic real-time 3D face tracking system shown in Figure 11. An automatic initialization module finds the face, locating eyes, nose and mouth coordinates in under a second. These are then used to initialize 8 normalized correlation tracking squares (i.e. sum-squared distance minimization [22]) on the face.
Each square can translate, rotate and scale and so is equivalent to two 2D point features (Figure 12(a)(b)(c)). The resulting 16 features are fed into the SfM algorithm resulting in the recovery of 16 rigid 3D points. This estimated rigid 3D model is then reprojected onto the image plane to generate a set of 16 rigidly constrained 2D points. These points are used to relocate the individual trackers for tracking the motion in the next frame. The trackers estimate an instantaneous trajectory yet are not permitted to follow through with it (i.e. in a nearest-neighbor tracking framework). Instead, this estimate is used in the SfM which computes the corresponding rigid trajectory and repositions the trackers along this rigid 'path' for the next frame in the sequence. Thus, instead of letting each square individually track, the SfM couples them all, forcing them to behave as if they were glued onto a rigid 3D body (i.e. a 3D face). Furthermore, the 8 trackers output an error level which can be used in the R matrix in the SfM Kalman filtering to adaptively weight good features more than bad features in the 3D estimates. Feature errors are mapped into a Gaussian uncertainty in localization by an initial perturbation analysis which computes each tracker's error sensitivity under small displacements.
The end result is a much more stable tracking framework (operating at 30Hz). If some trackers are occluded or fail, the others pull them along via the imposed rigidity constraint. The feedback from the adaptive Kalman filter maintains a sense of 3D structure and enforces a global collaboration between the separate 2D trackers. Thus, tracking remains stable for minutes instead of seconds (if no feedback SfM is used). Figure 12(d) depicts the stability under occlusion where a mouth and eye tracker are distracted by the presence of the user's finger. Similarly in Figure 12(e), the mouth tracker is distracted by deformation (smiling) where the mouth is no longer similar to the closed mouth the template was initialized with. These conditions remain stable due to the feedback loop.
The algorithm also re-initializes when it detects that it has lost the face as in Figure 13. This detection is performed via the so-called ``Distance-from-Face-Space'' calculation which essentially computes the probability of a face pixel image with respect to a constrained Gaussian distribution [40]. While multiple real and synthetic tests show very strong convergence we have also used the system extensively in the above real-time application settings where it behaved consistently and reliably.