We begin with an arbitrary image of a natural uncontrived scene containing people. We then generate an intensity image pyramid as in Figure and a corresponding edge map pyramid as shown in Figure . We use the edge map pyramid to apply the symmetry transform at various scales. This allows us to detect blobs of arbitrary size in the image. The blob detector uses only the 6 annular sampling regions described in Table . We can afford to limit the number of annular sampling regions to six at this stage since the subaveraging involved in the pyramid obviates the need for more scale invariance in the operator. We apply the general symmetry transform to each of the edge maps and mark the centers of the detected blobs on the intensity pyramid. The general transform (not the dark or bright symmetry transform) was utilized since heads and faces do not consistently appear either brighter or darker than the background of a scene. This multi-scale interest detection operation provides us with the blob detection pyramid displayed in Figure .
We thresholded the output of the interest map so that only attentional peaks exhibiting a certain minimal level of interest will appear in the output. The threshold on the interest map is very tolerant and allows many extremely weak blobs to register. Thus, the precise selection of an interest map threshold is not critical. Furthermore, we only consider the five (5) strongest peaks of interest or the five most significant blobs for each scale in the multi-scalar analysis. This is to prevent the system from spending too much time at each scale investigating blobs. We expect the face to be somewhat dominant in the image so that it will be one of the strongest five blobs in the image (at the scale it resides in). If we expect many faces or other blobs in the image at the same scale, this value can be increased beyond 5. This would be advantageous, for example, when analyzing group photos. Both a threshold on interest value and the limitation on the number of peaks are required since we do not wish to ever process more than 5 blobs per scale for efficiency and we require the blobs to exhibit a minimal level of significance to warrant any investigation whatsoever. Furthermore, we stop applying the interest operator for scales smaller than 4x. The interest operator is limited in size to r=9 pixels and consequently, the blobs detected at scales lower than 4x would be too small and would have insufficient resolution for subsequent facial feature localization and recognition. For example, the blobs detected at scale 3x would be less than 54 54 pixel objects and the representation of a face at such a resolution would prevent accurate facial feature detection.
The peaks in Figure are shown before we threshold the interest response, threshold the number of blobs per scale (maximum of 5) and before we limit the scale of the search. Once these three limits are introduced, the number of peaks generated by the face blob detection stage will drop as shown on the right hand side of Figure . Only a total of 5 peaks are valid after this filtering (as seen by the 5 square grids that remain for processing by the next stages).
There is some redundancy as some blobs are detected more than once at adjacent scales. This is due to overlap in scale-space of our symmetry transform operator. However, this redundancy or multiple-hit phenomenon is not problematic since we will use later stages to select only one 'hit' or one face out of several redundant blob responses. Additionally, the detection of non-facial blobs is not problematic at this stage. Since each blob is to undergo further processing to determine if it is truly a face, we can allow false alarms during blob detection. Finally, lack of accuracy in our blob detector is also acceptable at this stage since we will refine the localization of faces in subsequent stages. What is most dangerous at this early stage of the algorithm is a total miss of a face or head blob in the image. Fortunately, a clear miss of the face in our multi-scalar blob detection is extremely unlikely since heads and faces have consistently strong responses in the perceptual interest map.