The set of input images in Figure illustrates some of the variations in the intensity image that detection must be capable of overcoming to properly localize the face. These variations need appropriate compensation to isolate only the relevant data necessary for recognition. Furthermore, note that these variations can occur in any combination and are not mutually exclusive.
We propose a hierarchical detection method which can quickly and reliably converge to a localization of the face amidst a wide range of external visual stimuli and variation. It is necessary to precede expensive computations with simple and efficient ones in this hierarchy to maximize efficiency. The results of the initial, diffuse and large search space computations narrow the search space for the more localized, higher precision operations that will follow. In other words, the results of preliminary detections guide the use of subsequent operations in a feed-forward manner to restrict their application to only significant parts of the image. This reduces the probability of error since the subsequent detection steps will not be distracted by irrelevant image data. Furthermore, more robust operations precede more sensitive ones in our hierarchy since the sensitive operations in the hierarchy need to have adequate initialization from previous stages to prevent failure.
Figure displays the sequence of search steps for the face detection. We begin by searching for possible face or head-like blobs in the image. The detected blob candidates are examined to obtain an approximation of their contours. If these exhibit a face-like contour, their interior is scanned for the presence of eyes. Each of the possible pairs of eyes detected in the face are examined one at a time to see if they are in an appropriate position with respect to the facial contour. If they are, then we search for a mouth isolated by the facial contour and the position of the detected eyes. Once a mouth has been detected, the region to be searched for a nose is better isolated and we determine the nose position. Lastly, these facial coordinates are used to more accurately locate the iris within the eye region, if they are visible. The final result is a set of geometrical coordinates that specify the position, scale and pose of all possible faces in the image. The last few stages will be discussed in Chapter 4 which utilizes the facial coordinates to normalize the image and perform recognition. Note the many feedback loops which propagate data upwards in the hierarchy. These are used by the later stages to report failure to the preliminary stages so that appropriate action can be taken. For instance, if we fail to find a mouth, the pair of possible eyes that was used to guide the search for the mouth was not a valid one and we should consider the use of another possible pair of eyes.
Note the qualitative comparison of the different stages on the right of Figure . This is a figurative description of the coarse-to-fine approach of the algorithm. The initial stages of the search are very fast and coarse since they use low resolution operators. Furthermore, these operators are used to search relatively large regions in the image. Additionally, the early stages are robust to noise and do not need to have constrained data to function. Later stages yield more precise localization information and use high resolution, slow operators. However, they are sensitive to distracting external data or noise and therefore need to be applied in a small, constrained window for a local analysis. In other words, they need to be guided by the previous, robust stages of the search. This figurative description of the stages is merely intended to reflect the spirit with which detection is to be approached. In short, it begins with a 'quick and dirty' estimate of where the face is and then slowly refines its localization around that neighbourhood by searching for more precise albeit elusive targets (such as the iris). This concept (coarseness to fineness) will become clearer as the individual stages of the algorithm and their interdependencies are explained later.
We implement this hierarchical search as a control structure which utilizes a palette of tools that includes the biologically and perceptually motived computations developed in Chapter 2. These are used to extract low-level geometrical descriptions of image data which will be processed to generate a robust and accurate localization of the face.
Note that the detection algorithm is based on a variety of heuristics that vaguely describe a model for the human face. The multitude of thresholds and geometric relationships that we introduce at each stage of the localization define our model of the human face cumulatively. Furthermore, the thresholds and constraints on this face model have been kept relatively lax to allow for a wide range of face imaging situations. Consequently, the numerical parameters that are utilized are not critical, nor are they optimal or unnecessarily sensitive. Rather, the parameters allow for large margins of safety and are forgiving, allowing face detection to proceed despite noise, variations, etc. Thus, a flexible, forgiving model gives the system greater robustness and fewer misdetections. In fact, a face is such a multi-dimensional, highly deformable object that an explicit, precisely parametrized model would be very difficult to derive and manipulate.