Multi-attribute robust facial feature localization

Multi-Attribute Robust Facial Feature Localization

Oya Celiktutan, Hatice Cınar Akakın, Bulent SankurBogazici University

Electrical & Electronic Engineering Department34342 Bebek, Istanbul

{oya.celiktutan, hatice.cinar, bulent.sankur}@boun.edu.tr

Abstract

In this paper, we focus on the reliable detection of facialfiducial points, such as eye, eyebrow and mouth corners.The proposed algorithm aims to improve automatic land-marking performance in challenging realistic face scenar-ios subject to pose variations, high-valence facial expres-sions and occlusions. We explore the potential of severalfeature modalities, namely, Gabor Wavelets, IndependentComponent Analysis (ICA), Non-negative Matrix Factoriza-tion (NMF), and Discrete Cosine Transform (DCT), bothsingly and jointly. We show that the selection of the high-est scoring face patch as the corresponding landmark is notalways the best, but that there is considerable room for im-provement with the cooperation among several high scoringcandidates and also using a graph-based post-processingmethod. We present our experimental results on BosphorusFace database, a new challenging database.

1. IntroductionThe localization of facial features is a fundamental step

in many different applications such as facial expressionanalysis, face animation, 3D face reconstruction and it isinstrumental in face recognition and face tracking. Correctlocalization of the facial features greatly affects the over-all performance of the face processing system. Facial fea-ture localization remains still a challenging computer-visiontask due to the following reasons:

• Intrinsic variability: Facial expressions, pose, occlu-sions due to hair or hand movements or self-occlusiondue to rotations impede successful feature detection. Aunique facial feature localizer that will work well un-der all intrinsic variations of faces and that will deliverin a time efficient manner the target features has notyet been feasible.

• Acquisition conditions: Much as in the case of face

recognition, acquisition conditions, such as illumina-tion, resolution and pose greatly affect the localizationperformance. For example, feature localizer trained inone database (say, ORL) can perform poorly in anotherdatabase, (say, FRGC) where the acquisition condi-tions differ substantially from that of the former [1].

In this paper, we address the problem of the accuratelocalization of principal or primary facial fiducial points,which are the nose tip, chin tip, the two mouth corners, thefour inner and outer eye, and similarly the four eyebrow cor-ners, in total 12 points. These are called primary in that theyare characterized by corners and edges, and they are mostinstrumental in determining facial identity and expression.There are also secondary landmark points, such as nostrils,chin, nose bridge, cheek contours as many as 62 points.These are called secondary in that they have more scarcelow-level image evidence. These secondary landmarks aretypically found aided by a graph structure trained on pri-mary landmarks.

The novelty of our facial landmarking algorithm isthe use of a multi feature framework. We model fa-cial landmarks redundantly by four different feature cat-egories, namely, Discrete Cosine Transform (DCT), Non-negative Matrix Factorization (NNMF), Independent Com-ponent Analysis (ICA) and Gabor Wavelets. We run themin parallel in order to subsequently fuse their estimates ordecisions so as to combat the effects of illumination and ex-pression variations.

The paper is organized as follows. Previous work is re-viewed in Section 2. Automatic facial landmark extractionalgorithm, including feature extraction to decision fusionstage, is given in Section 3. The database and experimentalresults are provided in Section 4. In Section 5, we discussfuture work and draw our conclusions.

2. Previous Works on Facial LandmarkingWe can classify the existing works in the literature for

automatic facial landmarking into four main groups based

978-1-4244-2154-1/08/$25.00 c©2008 IEEE

on the type of features and anthropometrical informationthey use:

Appearance-based Approaches: These approaches gen-erally use texture (intensity) information only and learn thecharacteristics of the landmark neighborhoods projected ina suitable subspace. Examples of these approaches areeigenfeatures [11], Haar features [15], Independent Com-ponent Analysis [2] and Gabor features [16] in conjunc-tion with Multilayer Perceptron (MLP), Support Vector Ma-chines (SVMs) or boosted classifiers. Template matchingtechniques can also be thought as a subset of appearance-based approaches [4].

Geometry-based Approaches: These methods are basedon the geometric relationship (distances, angles etc.) be-tween facial features [14]. They have some success in de-tecting facial features; but they cannot handle large varia-tions in faces such as rotations, facial expressions and inac-curate landmarks.

Graph-based Approaches: These approaches generallyprocess together the intensity and geometric information,and seem to be very effective in many cases. The idea isbased on constructing a template (graph) from anthropo-metrical information and deforming this graph according toa defined criterion [17]. Another example of this approachis to first process face data by appearance-based methodsand then refine the results via a graphical model [1].

3D Vision-based Approaches: Recent advances in 3Dacquisition devices provide opportunities to use of 3D facedata. One advantage is that the 3D face measurements areindependent from lighting and pose variations. Howevergiven the noisy nature of the 3D data, they require a prepro-cessing stage to reduce spikes and discontinuities. Recent3D landmarking studies can be listed as [6], [8]. In addition,multi-modal approaches that combine 2D and 3D data, haveshown promising performance [12].

Our method is a combination of appearance-based andgraph-based methods. We run four appearance-based chan-nels in parallel, fuse the outcomes and, then resort to agraphical model to correct gross localization errors or es-timate the missing features. We test our method on a newchallenging database: the Bosphorus Face Database that isreplete with strong facial expressions, occlusions and rota-tions [13]. We also present experimental results on BioIDFace Database which includes a larger variety of illumina-tion conditions, backgrounds and face size; but more sub-dued in poses and expressions [18].

3. Facial Landmarking AlgorithmThe framework of the proposed method is given in Fig-

ure 1. First, the original image is downsampled by a fac-tor of 8 (i.e. 640 × 480 original image is downsampled to80 × 60 resolution) for a two-tier search. In the first low-resolution tier, each feature channel identifies the candidate

Figure 1. Block diagram of the proposed method.

landmark points in the face image. For each landmark, weallow a number of candidates that will take role in a votingor fusion scheme as outputted by the SVM. These candi-dates are qualified according the goodness scores given bythe corresponding SVM. For example, one approach is thefusion by using weighted median where normalized SVMscores are used as the weights of the candidate points. Whenmulti-feature approach is not able to locate landmark pointsat the coarse level, we declare as a missing landmark andwe try to recover these points via a graph-based back pro-jection, that is, simply estimating their position using theface graph fitted to the detected landmarks as in [12]. Fi-nally, these coarsely estimated points are refined by a localsearch over the full-resolution image.

3.1. Description of Features for Landmarks

To catch the landmarks, we tried to model their typ-ical characteristics by different approaches. In this pa-per, we investigate four different methods: DCT templates(Discrete Cosine Transform), Gabor Wavelet coefficients(Gabor Wavelet Transform), encoding vectors produced byNMF (Non-negative Matrix Factorization) and mixing co-efficients of ICA (Independent Component Analysis). Toform the training data, several k × k patches (i.e., k = 8for 80 × 60 resolution) centered at landmark points, suchas eye corners, nose tip, mouth corners, plus non-fiducialpoints (false positives) are cropped, vectorized and storedin the columns of a matrix. Then, we extract relevant fea-tures from the data matrix separately and use them to trainindividual SVM classifiers.

Discrete Cosine Transform: DCT templates representthe intensity changes and statistical shape variations in agiven image block. They are proved to be quite discrim-inative in the previous studies [1], [12]. The features arethe low to bandpass coefficients. In the coarse-level search,8 × 8 DCT blocks (corresponding to 64 × 64 blocks in thefull resolution) are considered while in the refinement stage,16× 16 DCT blocks are taken. In either case, two thirds ofthe DCT coefficients in the zigzag pattern are selected asthe feature vector and fed into their respective SVMs.

Gabor Wavelets: Gabor wavelets represent local char-acteristics and provide robustness against luminance vari-ations in the image. By changing the frequency and orien-tation parameters, one can match and hence locate patterns

Figure 2. Feature extraction method via Architecture I. The feature com-ponents in X are considered to be a linear combination of statistically in-dependent basis images, S, where A is an unknown mixing process.

having similar scales and orientations. In our experiments,we employed three scales v = 0, 1, 2 and four orientationµ = 0, 2, 4, 6 resulting in 12 Gabor filters [12]. Gabor fea-ture vector is obtained by convolving 8 × 8 patches withproposed filters. Then, PCA is used to reduce this 8×8×12-dimensional feature vector to 100-dimensional feature vec-tor.

Independent Component Analysis: ICA aims to expressa set of random variables as linear combination of statis-tically independent component variables [7]. There twodifferent approaches in the use of ICA [3]. In our experi-ments, we adopted Architecture I for facial landmark detec-tion problem as illustrated in Fig. 2, where the appearanceof landmarks modeled as a linear combination of statisti-cally independent source images. The data matrix, X isfirst subjected to PCA so that the data is first projected onthe M eigenvector matrix, V , that is R = XV . The ICAanalysis is performed on V T , where eigenvectors form therows of this matrix. Then, we obtain ICA basis images asS = WV T . By using the PCA representation coefficientsR and separating matrix W , de-mixing matrix is calculatedas A = RW−1. In testing stage, a given test block, xtest isfirst projected onto feature subspace rtest = xtestV . Then,ICA feature vector is obtained by multiplying with the in-verse of the separation matrix, atest = rtestW

−1. The de-cision as to whether a given patch xtest corresponds to oneof the fiducial landmarks is based on the comparison be-tween the training feature vectors {a1, a2 . . . ak} and thepatch data appropriately projected onto principal compo-nents and then de-mixed, that is, atest. Here, we use 52factors for the coarse level.

Non-negative Matrix Factorization: Non-negative Ma-trix Factorization (NMF) decomposes a data matrix into theproduct of two matrices that are constrained to have non-negative elements. The resulting factors can be interpretedas the subspace vectors and the weight coefficients, andhence NMF can be used as a feature extraction technique.More explicitly, given an n×m data matrix V , NMF findstwo non-negative matrices W ∈ Rn×r and H ∈ Rr×m

(a) Non-negative matrix factorization

(b) Encodings vectors corresponding to each feature type

Figure 3. Results of NMF algorithm applied to facial features

such that V ≈WH . Here, each column of V represents anobject, i.e. a human face or a small patch around a facialfiducial point.

In this study, we used projected gradient method in ourexperiments [10]. In Fig. 3-(a), the idea behind NMFis illustrated as a facial feature extraction method. In thetraining stage, NMF approximately factorized the data ma-trix, V which is composed of different feature types, intomatrices W and H . The rows of H , namely, encodingvectors construct the feature vectors of each fiducial point.Once we obtain W , the testing stage is straightforward. Forany unknown image block, feature vector is obtained ashtest = W † ∗ vtest. In Fig. 3-(b), we give some examplesof encoding vectors correspond to different type of features.As shown, various facial features result in different encod-ing vectors. In our experiments, as in ICA case, we use 52factors at the coarse level.

4. Decision Fusion and RefinementThe extracted features from patches classified by a bi-

nary SVM classifier, trained for that specific landmark, tobe a possible landmark point or not. This is repeated foreach patch: for example, in the coarse search for eye cor-ners we consider n possible patches in the upper left quarterof the face. We rank the positive outcomes from the SVMand then we can pick the highest ranking patch. However,the highest ranking patch might be misleading under someface scenarios, such as extreme facial expressions, differ-ent illumination conditions. Alternatively, we propose thefollowing algorithm to overcome these difficulties:

1) Score Fusion: Instead if immediately deciding upononly one candidate, we can select the top L highest scor-ing patches. Then, we fuse these multiple candidates usinga weighted median filter on their spatial locations as illus-

(a) (b) (c)

Figure 4. (a) Candidates populated from each feature channel, (b) Reli-able points selected by score fusion, (c) Graph completion.

trated in 5. Before median filtering, we re-normalize theSVM scores of candidates by min-max method and repli-cate the most probable points according to their normal-ized scores. If our search over the entire region does notyield any candidate points for a specific landmark we labelthis landmark as missing, and we leave its recovery to post-processing via the graph-based structural completion. Anexample of score fusion is given in Fig. 4.

2) Structural-completion: The structural completion isbased on a graph whose nodes coincide with the 12 land-mark points. The arcs between the nodes are modeled asGaussian spring forces with means and variances learnt dur-ing training phase. Thus, for example, the left outer eye cor-ner is tied tightly to left eye inner corner, but more loosely,to right eye corners, to mouth corners and to nose. Infact, the more the anthropometric variability in the train-ing database, the larger the corresponding variance and thelooser the bond. Here, the structural completion methodserves the purposes of recovering the missing landmarks.Accordingly, the reliably estimated landmarks are used asanchor points for the graph of the actual face. The miss-ing landmarks are simply read off from the adapted graph.In Fig. 5, we give an example scatter plot of the estimatedlocations of the missing landmark points based on seven re-liable landmark points (eyebrow corners, outer eye cornersand left mouth corner). The details can be found in [12].

3) Refinement: In the refining stage, each coarse-levellandmark is transferred to the full-resolution image, wherethey are refined by searching for a better fit around thecoarsely estimated points. This step is realized only usingDCT features.

5. Experimental Results5.1. Database

In our experiments, we utilized comparatively two facialdatabases: 1) Bosphorus Face Database [13] and 2) BioIDDatabase [18]. The full Bosphorus database is intended to

Figure 5. (a) Illustration of score fusion based on weighted median filter,(b) The spatial distribution of missing landmarks estimated with respect tothe reliable landmarks.

have a rich set of poses (13), expressions (34), some neu-trals and four occlusions, but no illumination effects. Weconsider a subset of 31 different facial expressions, slighthead rotations (smaller than 10o), occlusions (eye glasses,hand and hair) and neutral poses common to the 81 sub-jects. Some example images are shown in Fig. 6. The dataare split into two disjoint parts including non-overlappingsubjects; training set (1048 samples) and test set (1186 sam-ples).

The better known BioID database consists of 1521 graylevel images [18]. Each image shows the frontal view of aface of one out of 23 different test persons. It differs fromBosphorus database in that the images are acquired underless controlled conditions, lower resolution and often withillumination effects. We have manually picked 14 subjects(847 samples) as our training set and 9 subjects (674 sam-ples) for testing.

5.2. Experimental Results

The performance of the feature localizers is evaluatedin two ways: 1) By computing the mean of Euclidean dis-tances (in terms of pixel) between the estimated point andits manual ground point; 2) By computing the percentage of

Figure 6. Poses and expressions used in the experiments. Facial expres-sions; (M1) Mouth stretch, (M2) Lip suck, (M3) Nose wrinkler, (M4)Cheek puff, (M5) Outer brow raiser, (M6) Eyes closed, (M7) Happiness,(M8) Fear, (M9) Anger, (M10) Disgust; Occlusion: (M11) Hand, (M12)Eye glasses, (M13) Hair and (M14) Neutral pose.

faces where a specific feature was successfully detected. Anestimated point is considered as correctly detected if its dis-tance from the reference point is less than a threshold. Thisacceptance threshold is defined in terms of inter-ocular dis-tance (IOD). There is a general agreement that to take thisthreshold value as 10% of IOD.

In Fig. 7, we compare the average performance of indi-vidual feature channels and our fusion method for both ofthe databases. The Bosphorus database is composed of highquality images and homogeneous illumination conditions.For this reason, the individual feature channels exhibit allsimilar performance and nothing is practically gained byany fusion. However, in BioID database, the low qualityimages and adverse illumination conditions result in sig-nificantly lower performance. In this case, it does pay tofuse the scores of individual feature channels followed bygraph-completion resulting in a net improvement 7.1%. Themissing landmarks are estimated via graph-completion al-gorithm, which is itself based on the more accurately de-tected landmarks, hence yielding overall more reliable re-sults.

In a second experiment, we analyzed the cross-databaseeffects, that is, training the landmarker in one database andtesting on a diverse database. As expected, the highest per-formance is obtained when we perform both training andtest on the same database, as evident in Table 1. However,if we train over Bosphorus (BioID) and test on the other one,that is BioID (Bosphorus) the landmarking algorithm dete-riorates rapidly. For example, the average performance ofDCT features is 30.3% (train set: BioID, test set: Bospho-rus) and 33.4% (train set: Bosphorus, test set: BioID). How-ever, some of the lost performance can be recuperated viafusion scheme.

In Fig. 8, we separately investigate the localization per-formance for each type of landmark. One can observe thatthe most successfully detected landmarks are inner eye cor-ners which have an accuracy of 94.8%. The outer eye cor-ners (93.8%) and inner eyebrow corners (89.7%) can also be

Figure 7. Performance comparison of individual methods and fusionscheme.

Table 1. Average performances over varying training datasets(Threshold = 0.1× IOD).

TrainTest Bosphorus BioID

Bosphorus 80.2 38.7BioID 40.1 63.5

localized with high accuracy. The mouth corners and outereyebrow corners are the most affected landmark points byfacial expressions. For example, in images including ex-treme facial action units, such as mouth stretch, cheek puff,lip corner puller etc., if the fusion scheme can not detectany reliable points on mouth corners, structural-completionmethod becomes insufficient to handle these variations. Allbut one of the landmark points can be detected fairly suc-cessfully (over 75% accuracy), while chin tip fails most ofthe time (27.4%).

Finally, we analyze the localization performance with re-spect to facial expressions and occlusions. In neutral poses,the performance of fusion scheme is 98.5% and 92.6% forinner eye corners and mouth corners, respectively. On theother hand, DCT features achieve 95% accuracy for innereye corners and 91.1% for mouth corners. The performancedifference of DCT features and fusion scheme increaseswith respect to facial expressions and occlusions. For ex-ample, the localization performance of inner eye corners de-

Figure 8. Comparison of localization performance corresponding differ-ent landmark types. Circles over the landmarks have a radius equivalent tothe 0.1× IOD.

creases to 90% in images including eyeglasses, while withDCT features it drops down to 82.5%. Occlusion by handheavily affects both of the methods; in eye occlusion, theinner eye rate is 69.5% with fusion scheme and 63.4% withDCT features, while in mouth occlusion, the mouth rate de-creases to 20.7% and 15.8% with fusion scheme and DCTfeatures, respectively. For extreme facial action units, thefusion scheme generally surpasses the DCT features. As anexample, in cheek puff, the mouth localization accuracy is78.7% with fusion scheme and 68.2% with DCT features.In facial expressions, such as happiness, fusion scheme (in-ner eye rate = 95.1%, mouth rate = 93%) is more robust thanDCT features (inner eye rate = 85.3%, mouth rate = 89%).However, in some poses, the both methods exhibit equiva-lent performances. In poses where the eyes are closed, thelocalization accuracy of inner eye corners is approximatelyequal to 85% for both methods.

6. Conclusions & Future Directions

We have presented a multi-attribute face landmarkingmethod with score fusion based on weighted median filter.Fusion seems to contribute somewhat especially when facesare captured under uncontrolled conditions. Otherwise, itdoes help to combat against the effects of facial expressions,occlusions and poses. In our preliminary experiments, wehave obtained some promising results for different facial ex-pressions. We have achieved 89% overall localization per-formance for all landmarks, except for the chin tip. Thechin tip decreased the overall performance approximately9%; the contour of the chin does not provide discriminativecharacteristics as the other facial landmarks.

Whether one uses hand-picked features such as DCT orNMF coefficients or one uses Adaboost features as in theViola-Jones algorithm, we have come to the conclusion that

the face landmarking problem must be attacked by a facepose, followed by pose-specialized landmarker.

References[1] Akakın, H. C. and B. Sankur, ”Robust 2D/3D Face Landmarking”,

3DTV Con., 2007.

[2] Antonini, G., V. Popovici and J. P. Thiran, ”Independent ComponentAnalysis and Support Vector Machine for Face Feature Extraction”,Int. Conf. on Audio and Vide-based Biometric Person Authentication,pp. 111-118, Guildford, UK, 2003.

[3] Bartlett, M.S., J. R. Movellan, and T. J. Sejonowski, ”Face Recog-nition by Independent Component Analysis”, IEEE Trans. on NeuralNetworks, vol. 13, no. 6, 2002.

[4] Brunelli, R. and T. Poggio, ”Template Matching: Matched Spatial Fil-ters and Beyond”, MIT Technical Report, AIM-1549, 1995.

[5] Chang, C.C. and C.J. Lin, ”LIBSVM: A Library forSupport Vector Machines”, 2001. Software available athttp://www.csie.ntu.edu.tw/cjlin/libsvm.

[6] Colbry, D., G. Stockman and A. Jain, ”Detection of Anchor Points for3D Face Verification”, IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, 2005.

[7] Hyvrinen, A. and E. Oja, ”A fast fixed-point algorithm for independentcomponent analysis”, Neural Computation, 9(7):1483-1492, 1997.

[8] Gokberk, B., M. O. Irfanoglu and L. Akarun, ”3D Shape-based FaceRepresentation and Feature Extraction for Face Recognition”, Imageand Vision Computing, Vol. 24, Issue. 8, pp 857-869, August 2006.

[9] Lee, D. D. and H. S. Seung, ”Algorithms for Nonnegative Matrix Fac-torization”, Advances in Neural and Information Processing Systems13, pp. 556-562, 2001.

[10] Lin, C. J., ”Projected Gradient Methods for Non-negative MatrixFactorization”, Neural Computation, 2007.

[11] Ryu, Y. S. and S. Y. Oh, ”Automatic Extraction of Eye and MouthFields from a Face Image Using Eigenfeatures and Ensemble Net-works”, Applied Intelligence, 17, pp. 171-185, 2002.

[12] Salah, A. A., H. C. Akakin, L. Akarun and B. Sankur, ”2D/3D FacialLandmarking for Registiration and Recognition”, Annals of Telecom-munications, vol. 62, no. 1-2, pp. 1608-1633, 2007.

[13] Savran A., N. Alyuz, H. Dibeklioglu, O. Celiktutan, B. Gokberk, B.Sankur and L. Akarun, ”Bosphorus Database for 3D Face Analysis”,The First COST 2101 Workshop on Biometrics and Identity Manage-ment (BIOID), Roskilde University, Denmark, 2008.

[14] Shih, F. Y. and C. F. Chuang, ”Automatic Extraction of Head andFace Boundaries and Facial Features”, Special issue on Informaticsand computer science intelligent systems applications, pp. 117-130,2004.

[15] Viola, P. and M. Jones, ”Robust real-time object detection”, IJCV,vol. 57, no. 2, pp. 137154, 2004.

[16] Vukadinovic, D. and M. Pantic, ”Fully Automatic Facial FeaturePoint Detection using Gabor Feature Based Boosted Classifiers”,IEEE Int. Conf. on Systems, Man and Cybernetics, Hawaii, October2005.

[17] Wiskott, L., J. M. Fellous, N. Kruger and C. von der Malsburg, ”FaceRecognition by Elastic Bunch Graph Matching”, IEEE Trans. on Pat-tern Analysis and Machine Intelligence, no:7, pp. 775-779, 1997.

[18] A reference for The BioID Face Database(http://www.bioid.com/downloads/facedb/index.php).

Date post:	29-Nov-2023
Category:	Documents
Upload:	independent
View:	0 times
Download:	0 times

Multi-attribute robust facial feature localization

Documents