Early attempts at face recognition were mostly feature-based. These include Kanade's [19] work where a series of fiducial points are detected using relatively simple image processing techniques (edge maps, signatures, etc.) and their Euclidean distances are then used as a feature vector to perform recognition. More sophisticated feature extraction algorithm were proposed by Yuille, Cohen and Hallinan [45]. These use deformable templates that translate, rotate and deform in search of a best fit in the image. Often, these search techniques use a knowledge-based system or heuristics to restrict the search space with geometrical constraints (i.e. the mouth must be between the eyes) [10]. Unfortunately, such energy minimization methods are extremely computationally expensive and can get trapped in local minima. Furthermore, a certain tolerance must be given to the models since they can never perfectly fit the structures in the image. However, the use of a large tolerance value tends to destroy the precision required to recognize individuals on the basis of the model's final best-fit parameters. Nixon proposes the use of Hough transform techniques to detect structures more efficiently [31]. However, the problem remains that these detection-based algorithms need to be tolerant and robust and this often makes them insensitive to the minute variations needed for recognition. Recent research in geometrical, feature-based recognition [9] reported 95% recognition. However, the 30 features points used for each face were manually extracted from each image. Had some form of automatic localization been used, it would have generated poorer results due to lower precision. In fact, even the most precise deformable template matching algorithms such as Roeder's [40] and Colombo's [8] feature detectors generally have significant errors in detection. This is also true for other feature detection schemes such as Reisfeld's symmetry operator [37] and Graf's filtering and morphological operations [15]. Essentially, current systems for automatic detection of fiducial points are not accurate enough to obtain high recognition rates exclusively on the basis of simple geometrical statistics of the localization.
Holistic techniques have recently been popularized and generally involve the use of transforms to make the recognition robust to slight variations in the image. Rao [36] develops an iconic representation of faces by transforming them into a linear combination of natural basis functions. Manjunath [28] uses a wavelet transform to simultaneously extract feature points and to perform recognition on the basis of their Gabor wavelet jets. Such techniques perform well since they do not exclusively compute geometric relationships between fiducial points. Rather, they compare the jets or some other transform vector response around each fiducial point. Alternate transform techniques have been based on statistical training. For example, Pentland [44] uses the Karhunen-Loeve decomposition to generate the optimal basis for spanning mug-shot images of human faces and then uses the subsequent transform to map the faces into a lower-dimensional representation for recognition. This technique has also been applied by Akamatsu [1] on the Fourier-transformed images instead of the original intensity images. Recent work by Pentland [32] involves modular eigenspaces where the optimal intensity decomposition is performed around feature points independently (eyes, nose and mouth). Pentland [29] has also investigated the application of Karhunen-Loeve decomposition to statistically recognize individuals on the basis of the spectra of their dynamic Lagrangian warping into a standard template. These transform techniques have yielded very high recognition rates and have quickly gained popularity. However, these non-feature based techniques do not fare well under pose changes and have difficulty with natural, un-contrived face images.
In most holistic face recognition algorithms, the face needs to be either segmented or surrounded by a simple background. Furthermore, the faces presented to the algorithms need to be roughly frontal and well-illuminated for recognition to remain accurate. This is due to the algorithms' dependence on fundamentally linear or quasi-linear analysis techniques (Fourier, wavelet, Karhunen-Loeve decompositions, etc. are linear transformations). Thus, performance degrades rapidly under 3D orientation changes, non-linear illumination variation and background clutter (i.e. large, non-linear effects).