Facial Emotion Recognition: A Technical Introduction
Artificial Intelligence and Machine learning have roots way back in 1943 with Alan Turing devising the Turing Test. From there, its journey has been fairly what one would call a rollercoaster ride. From, being labeled as the key to all our miseries and the supposed reason for our destruction in the future, AI has been more on Sci-fi movies and dramas than in officially published papers. Often in the hopes of a human-like AI, Big firms have funded millions into it to avail nothing until 2005, when the funding had almost ceased. But the AI pioneers like Geoff Hinton and his students didn’t stop their research and published another paper that showed great results once again, but the previous failures had instilled a fear in the minds of the masses, so they couldn’t obscure much funding. He then participated and won ImageNet with his students in 2012, applying deep learning and thus achieving an incredible lead from the competition.
The challenge for the teams was to build a solution that would recognize a thousand images, from animals and people to landscapes.
In the first two years of the competition, the best teams did not achieve more than 75% accuracy. In the third year, Jeffrey Hinton and his students — suddenly broke through that ceiling. They won the competition by a staggering 10.8% and his method was deep learning.
This marks the beginning of the modern artificial intelligence revolution and has produced a great results in the past years from AI doctors to
FER: Introduction
In recent years, the research on facial emotion recognition has become extensive. Facial emotion recognition aims to help identify the state of human emotion (neutral, happy, sad, surprise, fear, anger, disgust, contempt) based on particular facial images.
There are mainly 6-class and 7-class FER. 6-class include Anger, Disgust, Fear, Joy, Sadness, and Surprise. 7-class includes Neutral as well.
Challenges of FER so far:
The challenge lies in recognizing the different curves and expressions, different people express the same emotion through their faces with high accuracy.
People across the globe from different races, sex, culture, body architect, and environment express the same emotion through a range of different muscle movements.
One challenge for facial expression recognition is recognizing facial expressions at low resolutions, as only compressed low-resolution video input is available in real-world applications.
Conventional process
According to the study held in 2012 FER system algorithm can be summarized as follows:
Also known as expression analysis, the emotion recognition system has a basic three-stage approach anonymous to all Machine learning/Deep Learning processes. Facial Detection is followed by Facial Features extraction(pre-processing) and finally the trained Emotion classifier or the algorithm generating labels for our data.
The facial action coding system is another popular approach by Ekman to classify emotions based on the movements of certain facial muscles. Facial actions are classified into different Action Units (AUs) and emotions are categorized using collections of AUs.
Face Detection:
This pre-processing stage is to identify human faces in the given dataset(images/videos) which is segmented into faces and other non-face regions. The popular algorithms with the highest accuracy for face detection include the following:
Haar classifier: It uses a set of Haar-like features that can be measured by expanding or reducing the size of the pixel group. Without going into detail with low computational complexities, it identifies the most contributing features to face detection. (Often used in the training phase)
Adaptive Skin color: It works on adaptive gamma-corrected images(to avoid illumination) to identify/segment the face and no-face regions using the skin-color model. (Not suitable in a real-time environment due to high computational complexity)
Adaboost Contour Points: It cascades several strong classifiers and uses the model created to compare with new faces. It also uses contour (contour plot of elevation on faces) points to give good accuracy and performance because the features extracted are fewer which leads to low computational complexity.
Feature Extraction
For FER, our data is unstructured and thus features aren’t clear. Instead of considering every part of the face closely, feature detection reduces the dimension of the input space while keeping the important information about shape, color, and spatial configuration of emotion indicating regions. Deriving an effective facial representation from original face images is a vital step for successful facial expression recognition.
It is usually a process of identifying the eye, nose, mouth, wrinkles, furrows, etc.
Some past used feature extraction techniques are histograms of oriented gradient (HOG), geometric features, Local Binary Pattern (LBP), Gabor filter texture, learned feature, and hybrid feature.
Let’s discuss a few popular techniques:
Local Binary Pattern (LBP) is a visual descriptor used in classification for computer vision, originally proposed for texture analysis. A Paper by Caifeng Shan, Shaogang Gong, and Peter W. McOwan, “Facial expression recognition based on Local Binary Patterns: A comprehensive study” made a comparison between several techniques to view the performance of FER using LBP features which determined that in a compact representation, LBP features able to retain discriminative facial information and derived faster in a single scan of raw images plus in low-dimensional feature space. However, the data collected must be in a highly controlled environment with a high resolution of frontal faces.
Linear Discriminant Analysis (LDA) is a supervised subspace learning technique. It can be trained quickly, and its retraining is straightforward. Paper by M.J. Lyons, J. Budynek, and S. Akamatsu, “Automatic Classification of Single Facial Images”, (1999), tested the algorithm on a finer discrimination task: recognition of the 6 basic facial expressions. The generalization rate of the expression recognizer for the 10 Japanese female image sets was 92 percent, basically same as the 90 percent obtained with a multilayer perceptron in the study of Zhang et al, implying that the linear LDA algorithm is adequately robust for this classification task. The paper by C. Shan, S. Gong, and P. W. McOwan, “Facial Expression Recognition Based on Local Binary Patterns: A comprehensive study”, 2009, adopted LDA to identify expressions using LBP features to search for the projection of axes on which have the data points of different emotions classes distant from each other while it puts data points of the same class to be closer.
Principle Component Analysis (PCA) is a linear appearance-based face recognition method based on the information theory approach that divides input face images into a small set of characteristic feature images called” Eigen Face”, a principal component of the preliminary training set of face images.
Convolutional Neural Network (CNN) is currently, one of the most known, and employed approaches in the deep learning techniques that uses a variation of multilayer perceptron on a minimal pre-processed dataset to give accurate results.
Some other Deep learning algorithms include Long Short Term Memory (LSTM) and Generative Adversarial Network (GAN)
Line Edge Map (LEM) was proposed by Yu Dong Zhang, Zhang Jing Yang, Hui Min Lu, Xing Xing Zhou, Preetha Phillips, Qing Ming Liu, and Shu Hua Wang in their paper, “Facial Emotion Recognition Based on Biorthogonal Wavelet Entropy, Fuzzy Support Vector Machine, and Stratified Cross Validation”, extracts lines from a face edge map as features. A fusion of template matching’s high performance and geometrical feature‘s invariance to illumination and low memory requirement. Using biorthogonal wavelet entropy and fuzzy multiclass support vector machine they achieved an overall accuracy of 96.77% with an error of 0.10.
Gabor Wavelet, this filtering technique characterizes the image as a selectively localized orientation because its simple version unnecessarily uses both the real and the imaginary coefficients, increasing the computational complexity and memory requirements. The face region (such as eyes, nose, mouth, wrinkles, dimples, scars, etc.) is enhanced as key features to represent the face in high dimensional space. It’s kind of robust to misalignment because it captures the local texture characterized by spatial frequency, spatial position, and orientation.
Expression Classification
After understanding the inputs and outputs, the software for emotion detection undergoes training to ensure that outputs are correct and appropriate. Usually, there are two approaches:
- Categorical: Categorising emotions into classes. (Images as labeled output)
- Dimensional:
- No discrete classification but emotions exist on a spectrum (outputs would need to be on a sliding scale). The PAD emotional state (a psychological model developed by Albert Mehrabian and James A. Russell (1974 and after) to describe and measure emotional states using three numerical dimensions, Pleasure, Arousal, and Dominance to represent all emotions) while the Circumplex model (all affective states arise from cognitive interpretations of core neural sensations that are the product of two independent neurophysiological systems) of affect uses two.
Ekman defined two main types of classes used in facial expression recognition action units and prototypic facial expressions.
Hidden Markov Model (HMM) uses a set of statistical models to describe the statistical behavior of a signal. In the paper by M. H. Siddiqi et al, “A Brief Review of Facial Emotion Recognition Based on Visual Information,” 2018, they investigate five choices of HMM models which are; (i) Left-Right HMM, (ii) Ergodic HMM, (iii) Emotion-Specific HMMs, (iv) Multilevel HMM and (v) Mixture of HMM and neural network. In conclusion, the HMM achieves better classification of either a 3 or 5-state model using independent expression or a combination of expressions using multilevel state models.
Neural Network (NN) executes a nonlinear reduction of the input dimensionality to formulate a statistical decision based on the probability of it belonging to the associated category.
Support Vector Machine (SVM)(for classification and regression analysis) uses a different kernel function to map data in input space into high-dimensional feature spaces.
Bayesian Network (BN), is a graphical model able to show clear and intuitive relationships by using a probabilistic approach among a set of attributes.
Sparse Representation Method (SRM) to set some co-efficient of non-relevant features to zero(regularization) and classify the face image.
Dataset
To experiment on FER, a standard database is required, which may be perceived as primary or secondary. A primary dataset consumes a long period to be completed with dataset collection.Some of the most popular datasets available currently are AffectNet(largest, diverse and robust still images), Extended Cohn Kanada (CK+)(public benchmark dataset for action units and emotion recognition with similar backgrounds, mostly grayscale, and 640×490 pixels), JAFFE, MMI, EMOTIC FER-2013, etc. There are different inputs that can be used or given to the system to analyze like images or videos.
Post-processing aims to enhance recognition precision, by manipulating domain knowledge to revamp classification errors.
Comparisons of the previous papers
FER: The Debate
Facial emotion recognition has been a hot issue of debate in the past years regarding its usability and relevance.
Experts argue that even if we truncate our database or range to average adults from urban society, ignoring the anomalies that might increase if we consider children, “It is not possible to confidently infer happiness from a smile, anger from a scowl, or sadness from a frown, as much of current technology tries to do when applying what are mistakenly believed to be the scientific facts” as famously concluded by the psychologist and neuroscientist Lisa Feldman Barrett.
They found limited reliability (emotions don’t always generate the same facial movements), lack of specificity (emotion-facial movement pairs don’t present unique mappings), and limited generalizability (cultural and contextual factors haven’t been sufficiently documented). The exposure of biases in face and emotion recognition technologies gave way to a more crucial debate.
However, even though it may not be the most accurate predictor there are questions as to whether it is a safe practice and if we are ready for our emotions continuously analyzed and read by the government or other sources.
Keeping aside the drawbacks and dilemma of FER, if we focus on how much it has been contributing and how it can be revolutionizing the world in coming years.
Emotion recognition is already used by schools and other institutions since it can help prevent violence and improves the overall security of a place. Some companies use AI with emotion recognition API capabilities as HR assistants. The system helps determine whether the candidate is honest and truly interested in the position by evaluating intonations, facial expressions, and keywords, and creating a report for the human recruiters for final assessment. It is a powerful tool in the field of business and marketing where customer emotion is the essence of a business. The Healthcare industry is using it for physicians to know which patient to prioritize seeing first.
FER can be useful for autistic patients and mentally disabled people who have difficulty analyzing, understanding, and reading facial expressions.
FER may contribute a lot to the interaction with Human-Robot Interface (HRI), and turn how Robots can take on jobs where a surface knowledge of user emotion is required and can be useful in making AI pets.
References
- Introduction to Emotion Recognition 2021 | RecFaces: Emotion Recognition: Introduction to Emotion Reading Technology
- Illiana Azizan, Fatimah Khalid ; Facial Emotion Recognition: A Brief Review, International Conference on Sustainable Engineering, Technology and Management (ICSETM -2018), Dec. 20, 2018, Negeri Sembilan, Malaysia