This article needs to be updated.(January 2023) |
Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer science, psychology, and cognitive science. [1] While some core ideas in the field may be traced as far back as to early philosophical inquiries into emotion, [2] the modern idea originated with Rosalind Picard's 1995 paper entitled "Affective Computing" [3] and her 1997 book of the same name [4] published by MIT Press. [5] [6] One motivation for researching affective computing is the ability to give machines emotional intelligence, including simulating empathy. The goal is that a machine should interpret the emotional state of humans and adapt its behavior to those emotions, responding appropriately. Recent experimental research has shown that subtle affective haptic feedback can shape human reward learning and mobile interaction behavior, [7] suggesting that affective computing systems may not only interpret emotional states but also actively modulate user actions through emotion-laden outputs.
Detecting emotional information usually begins with passive sensors that capture data about the user's physical state or behavior without interpreting the input. The data gathered is analogous to the cues humans use to perceive emotions in others. For example, a video camera might capture facial expressions, body posture, and gestures, while a microphone might capture speech. Other sensors detect emotional cues by directly measuring physiological data, such as skin temperature and galvanic resistance. [8]
Recognizing emotional information requires the extraction of meaningful patterns from the gathered data. This is done using machine learning techniques that process different modalities, such as speech recognition, natural language processing, or facial expression detection. The goal of most of these techniques is to produce labels that would match the labels a human perceiver would give in the same situation. For example, if a person makes a facial expression furrowing their brow, then the computer vision system might be trained to label their face as appearing "confused" or as "concentrating" or "slightly negative" (as opposed to positive, which it might say if they were smiling in a happy-appearing way). This response is based on the data used to train the system. These labels may or may not correspond to what the person is actually feeling.
Another area within affective computing is the design of computational devices proposed to exhibit either innate emotional capabilities or that are capable of convincingly simulating emotions. A more practical approach, based on current technological capabilities, is the simulation of emotions in conversational agents in order to enrich and facilitate interactivity between human and machine. [9]
Marvin Minsky, one of the pioneering computer scientists in artificial intelligence, relates emotions to the broader issues of machine intelligence stating in The Emotion Machine that emotion is "not especially different from the processes that we call 'thinking.'" [10] The innovative approach "digital humans" or virtual humans includes an attempt to give these programs, which simulate humans, an emotional dimension as well, including reactions, facial expressions, and gestures in accordance with the reaction that a real person would have in certain emotionally stimulating situations. [11]
Emotion in machines often refers to emotion in computational, often AI-based, systems. As a result, the terms 'emotional AI' and 'emotion AI' are being used. [12] Some modern large language models simulate emotions in their chats with humans, although this is not a perfected science. ChatGPT's simulated emotion leans more positive than that of most human responses. [13]
In psychology, cognitive science, and in neuroscience, there have been two main approaches for describing how humans perceive and classify emotion: continuous or categorical. The continuous approach tends to use dimensions such as negative vs. positive, calm vs. aroused.
The categorical approach tends to use discrete classes such as happy, sad, angry, fearful, surprise, and disgust. Different kinds of machine learning regression and classification models are used for machines to produce continuous or discrete labels. Sometimes, models are also built that allow combinations across the categories (e.g. a happy-surprised face or a fearful-surprised face). [14]
The following sections consider many of the kinds of input data used for the task of emotion recognition.
Various changes in the autonomic nervous system can indirectly alter a person's speech, and affective technologies can leverage this information to recognize emotion. For example, speech produced in a state of fear, anger, or joy becomes fast, loud, and precisely enunciated, with a higher and wider range in pitch, whereas emotions such as tiredness, boredom, or sadness tend to generate slow, low-pitched, and slurred speech. [15] Some emotions have been found to be more easily computationally identified, such as anger [16] or approval. [17]
Emotional speech processing technologies recognize the user's emotional state using computational analysis of speech features. Vocal parameters and prosodic features such as pitch variables and speech rate can be analyzed through pattern recognition techniques. [16] [18]
Speech analysis is an effective method of identifying affective state, having an average reported accuracy of 70 to 80% in research from 2003 and 2006. [19] [20] These systems tend to outperform average human accuracy (approximately 60% [16] ) but are less accurate than systems which employ other modalities for emotion detection, such as physiological states or facial expressions. [21] However, since many speech characteristics are independent of semantics or culture, this technique is considered to be a promising route for further research. [22]
The process of speech/text affect detection requires the creation of a reliable database, knowledge base, or vector space model, [23] broad enough to fit every need for its application, as well as the selection of a successful classifier which will allow for quick and accurate emotion identification.
As of 2010 [update] , the most frequently used classifiers were linear discriminant classifiers (LDC), k-nearest neighbor (k-NN), Gaussian mixture model (GMM), support vector machines (SVM), artificial neural networks (ANN), decision tree algorithms, and hidden Markov models (HMMs). [24] Various studies showed that choosing the appropriate classifier can significantly enhance the overall performance of the system. [21] The list below gives a brief description of each algorithm:
It has been proven that having enough acoustic evidence available the emotional state of a person can be classified by a set of majority voting classifiers. The proposed set of classifiers is based on three main classifiers: kNN, C4.5 and SVM-RBF Kernel. This set achieves better performance than each basic classifier taken separately. It is compared with two other sets of classifiers: one-against-all (OAA) multiclass SVM with Hybrid kernels and the set of classifiers which consists of the following two basic classifiers: C5.0 and Neural Network. The proposed variant achieves better performance than the other two sets of classifiers. [26]
The vast majority of present systems are data-dependent. This creates one of the biggest challenges in detecting emotions based on speech, as it implicates choosing an appropriate database to train the classifier. Most of the current available data was obtained from actors, and is thus a representation of archetypal emotions. Those so-called "acted databases" are usually based on the Basic Emotions theory by Paul Ekman, which assumes the existence of six basic emotions (anger, fear, disgust, surprise, joy, sadness), the others a mix of those six. [27] Nevertheless, these still offer high audio quality and balanced classes (although often too few), which contribute to high success rates in recognizing emotions.
However, for real life application, naturalistic data is preferred. A naturalistic database can be produced by observation and analysis of subjects in their natural context. Ultimately, such database should allow the system to recognize emotions based on their context as well as work out the goals and outcomes of the interaction. The nature of this type of data allows for authentic real-life implementation, due to the fact it describes states naturally occurring during human–computer interaction (HCI).
Despite the numerous advantages which naturalistic data has over acted data, it is difficult to obtain and usually has low emotional intensity. Moreover, data obtained in a natural context has lower signal quality, due to surroundings, noise, and the subjects' distance from the microphone. The first attempt to produce such a database was the FAU Aibo Emotion Corpus for CEICES (Combining Efforts for Improving Automatic Classification of Emotional User States), which was developed based on a realistic context of children (age 10–13) playing with Sony's Aibo robot pet. [28] [29] Likewise, producing one standard database for all emotional research would provide a method of evaluating and comparing different affect recognition systems.
The complexity of the affect recognition process increases with the number of classes (affects) and speech descriptors used within the classifier. It is, therefore, crucial to select only the most relevant features in order to assure the ability of the model to successfully identify emotions and increase performance, which is particularly significant to real-time detection. The range of possible choices is vast, with some studies mentioning the use of over 200 distinct features. [24] It is crucial to identify those that are redundant and undesirable in order to optimize the system and increase the success rate of correct emotion detection. The most common speech characteristics are categorized into the following groups. [28] [29]
The detection and processing of facial expression is achieved through various methods such as optical flow, hidden Markov models, neural network processing, or active appearance models. More than one modality can be combined or fused (multimodal recognition, e.g. facial expressions and speech prosody, [31] facial expressions and hand gestures, [32] or facial expressions with speech and text for multimodal data and metadata analysis) to provide a more robust estimation of the subject's emotional state.
Creation of an emotion database is a difficult and time-consuming task. However, database creation is an essential step in the formation of a system that will recognize human emotions. Most publicly available emotion databases include posed facial expressions only. In posed expression databases, the participants are asked to display different basic emotional expressions. In the alternative spontaneous expression databases, the expressions are natural. Spontaneous emotion elicitation requires significant effort in the selection of proper stimuli which can lead to a rich display of intended emotions. Secondly, the process requires emotions to be tagged by trained individuals manually, which makes the databases highly reliable. Since perception of expressions and their intensity is subjective in nature, the annotation by experts is essential for the purpose of validation.
Researchers work with three types of databases: a database of peak expression images only, a database of image sequences portraying an emotion from neutral to its peak, and video clips with emotional annotations. Many facial expression databases have been created and made public for expression recognition purposes. Two widely used databases are CK+ and JAFFE.
During the late 1960's, Paul Ekman proposed the idea that facial expressions of emotion are not culturally determined, but universal. He discovered this by doing cross-cultural research in Papua, New Guinea on the Fore Tribesmen. Thus, he suggested that they are biological in origin and can, therefore, be safely and correctly categorized. [27] He therefore officially put forth six basic emotions, in 1972: [33]
However, in the 1990s, Ekman expanded his list of basic emotions, including a range of positive and negative emotions, not all of which are encoded in facial muscles. [34] The new emotions are:
A system has been conceived by psychologists in order to formally categorize the physical expression of emotions on faces. The central concept of the Facial Action Coding System, or FACS, as created by Paul Ekman and Wallace V. Friesen in 1978 based on earlier work by Carl-Herman Hjortsjö [35] . Action units (AU) are, basically, a contraction or a relaxation of one or more muscles. Psychologists have proposed the following classification of six basic emotions according to their action units ("+" here means "and"):
| Emotion | Action units |
|---|---|
| Happiness | 6+12 |
| Sadness | 1+4+15 |
| Surprise | 1+2+5B+26 |
| Fear | 1+2+4+5+20+26 |
| Anger | 4+5+7+23 |
| Disgust | 9+15+16 |
| Contempt | R12A+R14A |
As with every computational practice, some obstacles need to be surpassed in order to fully unlock the hidden potential of in affect detection by facial processing. In the early days of AI-based detection (speech recognition, face recognition, and affect recognition), the accuracy of modeling and tracking was an issue. As hardware evolves, this lack of accuracy fades. This, though, leaves behind noise issues. However, methods for noise removal exist, including neighborhood averaging, linear Gaussian smoothing, median filtering, [36] or newer methods such as the Bacterial Foraging Optimization Algorithm. [37] [38]
Other challenges include:
Gestures can be used as a means of detecting the particular emotional state of a user, especially when used in conjunction with speech and face recognition. Gestures could be simple reflexive responses, like lifting one's shoulders when one does not know the answer to a question, or they could be complex and meaningful, as when communicating with sign language. Without making use of any object or surrounding environment, we can wave our hands, clap or beckon. On the other hand, when using objects, we can point at them, move, touch, or handle them. A computer should be able to recognize these gestures, analyze the context, and respond in a meaningful way in order to be used for Human–Computer Interaction.
Gestures can be detected using many methods. [40] There are two main approaches towards gesture recognition: 3D model based and appearance-based. [41] The foremost method makes use of 3D information of key elements of the body parts in order to obtain several important parameters, like palm position or joint angles. On the other hand, appearance-based systems use images or videos to for direct interpretation. Hand gestures have been a common focus of body gesture detection methods. [41]
This could be used to detect a user's affective state by monitoring and analyzing their physiological signs. These signs range from changes in heart rate and skin conductance to minute contractions of the facial muscles and changes in facial blood flow. This area is gaining momentum and we are now seeing real products that implement the techniques. The four main physiological signs that are usually analyzed are blood volume pulse, galvanic skin response, facial electromyography, and facial color patterns.
A subject's blood volume pulse (BVP) can be measured by a process called photoplethysmography, which produces a graph indicating blood flow through the extremities. [42] The peaks of the waves indicate a cardiac cycle where the heart has pumped blood to the extremities. If the subject experiences fear or is startled, their heart usually 'jumps' and beats quickly for some time, causing the amplitude of the cardiac cycle to increase. This can clearly be seen on a photoplethysmograph when the distance between the trough and the peak of the wave has decreased. As the subject calms down, and as the body's inner core expands, allowing more blood to flow back to the extremities, the cycle will return to normal.
Researchers shone infrared light on the skin using special sensor hardware, and the amount of light reflected was measured. The amount of reflected and transmitted light correlates to the BVP as light is absorbed by hemoglobin which is found richly in the bloodstream.
It can be cumbersome to ensure that the sensor shining an infrared light and monitoring the reflected light is always pointing at the same extremity, especially because subjects often stretch and readjust their position while using a computer.
There are other factors that can affect one's blood volume pulse. Since it is a measure of blood flow through the extremities, if the subject feels particularly hot or cold, then their body may allow more or less blood to flow to the extremities regardless of the subject's emotional state.
Facial electromyography is a technique used to measure the electrical activity of the facial muscles by amplifying the tiny electrical impulses that are generated by muscle fibers when they contract. [43] The face expresses a great deal of emotion. However, there are two main facial muscle groups that are usually studied to detect emotion: The corrugator supercilii muscle, also known as the 'frowning' muscle, draws the brow down into a frown. Therefore, it is the best test for negative, unpleasant emotional responses.↵The zygomaticus major muscle is responsible for pulling the corners of the mouth back when one smiles, and therefore is the muscle used to test for a positive emotional response.
Galvanic skin response (GSR) is an outdated term for a more general phenomenon known as Electrodermal activity, or EDA. EDA is a general phenomena whereby the skin's electrical properties change. The skin is innervated by the sympathetic nervous system, so measuring its resistance or conductance provides a way to quantify small changes in the sympathetic branch of the autonomic nervous system. As sweat glands are activated, even before the skin feels sweaty, the level of the EDA can be captured (usually using conductance) and used to discern small changes in autonomic arousal. The more aroused a subject is, the greater the skin conductance tends to be. [42]
Skin conductance is often measured using two small silver-silver chloride electrodes placed somewhere on the skin and applying a small voltage between them. To maximize comfort and reduce irritation the electrodes can be placed on the wrist, legs, or feet, which leaves the hands fully free for daily activity.
The surface of the human face is innervated with a large network of blood vessels. Blood flow variations in these vessels yield visible color changes on the face. Whether or not facial emotions activate facial muscles, variations in blood flow, blood pressure, glucose levels, and other changes occur. Also, the facial color signal is independent from that provided by facial muscle movements. [44]
Approaches are based on facial color changes. Delaunay triangulation is used to create the triangular local areas. Some of these triangles which define the interior of the mouth and eyes (sclera and iris) are removed. The left triangular areas' pixels are used to create feature vectors. [44] Evidence shows that converting the pixel color of the standard RGB color space to a color space—such as oRGB color space [45] or LMS channels—allows for better performance when dealing with faces. [46] So, vectors are mapped onto the better color space and decomposed into red-green and yellow-blue channels. Then deep learning methods find equivalent emotions.
Aesthetics, in the world of art and photography, refer to the principles of nature and appreciation of beauty. Judging beauty and other aesthetic qualities is a highly subjective task. Computer scientists at Penn State treat the challenge of automatically inferring the aesthetic quality of pictures using their visual content as a machine learning problem, with a peer-rated on-line photo sharing website as a data source. [47] They extract certain visual features based on the intuition that they can discriminate between aesthetically pleasing and displeasing images.
Affection influences learning state. Using affective computing technology, computers can judge the learners' affection and learning state by recognizing their facial expressions. In education, a teacher can use this analysis result to understand the student's learning and accepting ability, and then formulate reasonable teaching plans. At the same time, they can pay attention to students' inner feelings, which is helpful to students' psychological health. Especially in asynchronous education, due to the separation of time and space, there is no emotional incentive between teachers and students for two-way communication. Without the atmosphere brought by traditional classroom learning, students are easily bored, which can affect learning outcomes. Applying affective computing in distance education system can improve this situation. [48] Emotional AI can provide students with AI-based learning support, which benefits students' cognitive and emotional outcomes. [49]
The applications of sensory computing may contribute to improving road safety. For example, a car can monitor the emotion of all occupants and engage in additional safety measures, such as alerting other vehicles if it detects the driver to be angry. [50] In addition, affective computing systems may allow various interventions such as driver assistance systems adjusted according to the driver's stress level [51] as well as minimal and direct interventions to change the emotional state of the driver. [52]
Social robots, as well as a growing number of robots used in healthcare, benefit from emotional awareness because they can better judge users' and patients' emotional states, altering their actions/programming appropriately. This is especially important in countries with growing aging populations and a lack of younger workers to address their needs. [53]
Affective computing is also being applied to the development of communicative technologies for use by people with autism. [54] The affective component of a text is also increasingly gaining attention, particularly its role in the so-called emotional or emotive Internet. [55]
Affective video games can access their players' emotional states through biofeedback devices. [56] A particularly simple form of biofeedback is a gamepad that measures the pressure with which a button is pressed: this has been shown to correlate strongly with the players' level of arousal. [57] At the other end of the scale are brain–computer interfaces. [58] [59] Affective games have been used in medical research to support the emotional development of autistic children. [60]
Training methods of psychomotor operations, such as steering and maneuvering, are used in various fields such as aviation, transportation and medicine. Integrating affective computing capabilities in these types of training systems, in accordance with the adaptive automation approach, has been found to be effective in improving the quality of training and shortening the required training duration. [61]
Affective computing has potential applications in human–computer interaction, such as affective mirrors that allow the user to see how they perform, emotion monitoring agents that send a warning before one sends an angry email, or even music players that can select tracks based on mood. [62]
One idea put forth by the Romanian researcher Dr. Nicu Sebe in an interview is the analysis of a person's face while they are using a certain product (he mentioned ice cream as an example). [63] Companies would then be able to use such analysis to infer whether their product will or will not be well received by the respective market.
One could also use affective state recognition in order to judge the impact of a TV advertisement through a real-time video recording of that person and through the subsequent study of their facial expression. Averaging the results obtained on a large group of subjects, one can tell whether that commercial (or movie) has the desired effect and what the elements which interest the watcher most are.
Within the field of human–computer interaction, Rosalind Picard's cognitivist or "information model" concept of emotion has been criticized by and contrasted with the "post-cognitivist" or "interactional" pragmatist approach taken by Kirsten Boehner and others which views emotion as inherently social. [64]
Picard's focus is human–computer interaction, and her goal for affective computing is to "give computers the ability to recognize, express, and in some cases, 'have' emotions". [4] In contrast, the interactional approach seeks to help "people to understand and experience their own emotions" [65] and to improve computer-mediated interpersonal communication. It does not necessarily seek to map emotion into an objective mathematical model for machine interpretation, but rather let humans make sense of each other's emotional expressions in open-ended ways that might be ambiguous, subjective, and sensitive to context. [65] : 284 [ example needed ]
Picard's critics describe her concept of emotion as "objective, internal, private, and mechanistic". They say it reduces emotion to a discrete psychological signal occurring inside the body that can be measured and which is an input to cognition, undercutting the complexity of emotional experience. [65] : 280 [65] : 278
The interactional approach asserts that though emotion has biophysical aspects, it is "culturally grounded, dynamically experienced, and to some degree constructed in action and interaction". [65] : 276 Put another way, it considers "emotion as a social and cultural product experienced through our interactions". [66] [65] [67]
Usage of affective computing tools such as ChatGPT has led to some individuals developing parasocial relationships with the LLMs. One woman who was interviewed by The New York Times chatted with her "A.I. boyfriend", ChatGPT, for up to fifty-six hours every week. [68] [69] These kinds of interactions with affective computing tools may increase feelings of depression or loneliness in those with underlying mental health conditions, resulting in a phenomenon often coined "Chatbot Psychosis". [70] Matthew Raine and Megan Garcia, who lost their sons in A.I.-related incidents, have sued OpenAI for encouraging these kinds of relationships with ChatGPT. [71]
Although affective computing systems have many potential uses, some of those uses pose ethical concern. Usage of affective computing systems may be used to invade privacy by analyzing people's facial expressions in public. There is also a risk of affective computing systems being used to manipulate an audience's emotions, propagandizing them. Additionally, systems that help others may pose small trade-off risks such as arousing feelings that are not connected to the real world or unintentionally making the rich richer through effective advertising. [72]
The introduction of emotion to computer science was done by Pickard (sic) who created the field of affective computing.
Rosalind Picard, a genial MIT professor, is the field's godmother; her 1997 book, Affective Computing, triggered an explosion of interest in the emotional side of computers and their users.
{{cite journal}}: CS1 maint: article number as page number (link)