Emotion recognition in conversation

Last updated

Emotion recognition in conversation (ERC) is a sub-field of emotion recognition, that focuses on mining human emotions from conversations or dialogues having two or more interlocutors. [1] The datasets in this field are usually derived from social platforms that allow free and plenty of samples, often containing multimodal data (i.e., some combination of textual, visual, and acoustic data). [2] Self- and inter-personal influences play critical role [3] in identifying some basic emotions, such as, fear, anger, joy, surprise, etc. The more fine grained the emotion labels are the harder it is to detect the correct emotion. ERC poses a number of challenges, [1] such as, conversational-context modeling, speaker-state modeling, presence of sarcasm in conversation, emotion shift across consecutive utterances of the same interlocutor.

Contents

The task

The task of ERC deals with detecting emotions expressed by the speakers in each utterance of the conversation. ERC depends on three primary factors – the conversational context, interlocutors' mental state, and intent. [1]

Datasets

IEMOCAP, [4] SEMAINE, [5] DailyDialogue, [6] and MELD [7] are the four widely used datasets in ERC. Among these four datasets, MELD contains multiparty dialogues.

Methods

Approaches to ERC consist of unsupervised, semi-unsupervised, and supervised [8] methods. Popular supervised methods include using or combining pre-defined features, recurrent neural networks [9] (DialogueRNN [10] ), graph convolutional networks [11] (DialogueGCN [12] ), and attention gated hierarchical memory network. [13] Most of the contemporary methods for ERC are deep learning based and rely on the idea of latent speaker-state modeling.

Emotion Cause Recognition in Conversation

Recently a new subtask of ERC has emerged that focuses on recognising emotion cause in conversation. [14] Methods to solve this task rely on language models-based question answering mechanism. RECCON [14] is one of the key datasets for this task.

See also

Related Research Articles

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the main benefit of searchability. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

<span class="mw-page-title-main">Recurrent neural network</span> Computational model used in machine learning

A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. Recurrent neural networks are theoretically Turing complete and can run arbitrary programs to process arbitrary sequences of inputs.

<span class="mw-page-title-main">Long short-term memory</span> Artificial recurrent neural network architecture used in deep learning

Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) can process not only single data points, but also entire sequences of data. This characteristic makes LSTM networks ideal for processing and predicting data. For example, LSTM is applicable to tasks such as unsegmented, connected handwriting recognition, speech recognition, machine translation, robot control, video games, and healthcare.

<span class="mw-page-title-main">Object detection</span>

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">MNIST database</span> Database of handwritten digits

The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

<span class="mw-page-title-main">Convolutional neural network</span> Artificial neural network

In deep learning, a convolutional neural network is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networks (SIANN), based on the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation-equivariant responses known as feature maps. Counter-intuitively, most convolutional neural networks are not invariant to translation, due to the downsampling operation they apply to the input. They have applications in image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, natural language processing, brain–computer interfaces, and financial time series.

Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Generally, the technology works best if it uses multiple modalities in context. To date, the most work has been conducted on automating the recognition of facial expressions from video, spoken expressions from audio, written expressions from text, and physiology as measured by wearables.

<span class="mw-page-title-main">Data augmentation</span> Data analysis technique

Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce overfitting when training a machine learning model. It is closely related to oversampling in data analysis.

<span class="mw-page-title-main">AlexNet</span> Convolutional neural network

AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor.

Multimodal sentiment analysis is a new dimension of the traditional text-based sentiment analysis, which goes beyond the analysis of texts, and includes other modalities such as audio and visual data. It can be bimodal, which includes different combinations of two modalities, or trimodal, which incorporates three modalities. With the extensive amount of social media data available online in different forms such as videos and images, the conventional text-based sentiment analysis has evolved into more complex models of multimodal sentiment analysis, which can be applied in the development of virtual assistants, analysis of YouTube movie reviews, analysis of news videos, and emotion recognition such as depression monitoring, among others.

An event camera, also known as a neuromorphic camera, silicon retina or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.

Energy-based generative neural networks is a class of generative models, which aim to learn explicit probability distributions of data in the form of energy-based models whose energy functions are parameterized by modern deep neural networks. Its name is due to the fact that this model can be derived from the discriminative neural networks. The parameter of the neural network in this model is trained in a generative manner by Markov chain Monte Carlo(MCMC)-based maximum likelihood estimation. The learning process follows an ''analysis by synthesis'' scheme, where within each learning iteration, the algorithm samples the synthesized examples from the current model by a gradient-based MCMC method, e.g., Langevin dynamics, and then updates the model parameters based on the difference between the training examples and the synthesized ones. This process can be interpreted as an alternating mode seeking and mode shifting process, and also has an adversarial interpretation. The first energy-based generative neural network is the generative ConvNet proposed in 2016 for image patterns, where the neural network is a convolutional neural network. The model has been generalized to various domains to learn distributions of videos, and 3D voxels. They are made more effective in their variants. They have proven useful for data generation, data recovery, data reconstruction.

<span class="mw-page-title-main">Video super-resolution</span> Generating high-resolution video frames from given low-resolution ones

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

<span class="mw-page-title-main">Layer (deep learning)</span>

A layer in a deep learning model is a structure or network topology in the model's architecture, which takes information from the previous layers and then passes it to the next layer. There are several famous layers in deep learning, namely convolutional layer and maximum pooling layer in the convolutional neural network. Fully connected layer and ReLU layer in vanilla neural network. RNN layer in the RNN model and deconvolutional layer in autoencoder etc.

<span class="mw-page-title-main">Self-supervised learning</span> Machine learning paaradigm

Self-supervised learning (SSL) refers to a machine learning paradigm, and corresponding methods, for processing unlabelled data to obtain useful representations that can help with downstream learning tasks. The most salient thing about SSL methods is that they do not need human-annotated labels, which means they are designed to take in datasets consisting entirely of unlabelled data samples. Then the typical SSL pipeline consists of learning supervisory signals in a first stage, which are then used for some supervised learning task in the second and later stages. For this reason, SSL can be described as an intermediate form of unsupervised and supervised learning.

Small object detection is a particular case of object detection where various techniques are employed to detect small objects in digital images and videos. "Small objects" are objects having a small pixel footprint in the input image. In areas such as aerial imagery, state-of-the-art object detection techniques under performed because of small objects.

References

  1. 1 2 3 Poria, Soujanya; Majumder, Navonil; Mihalcea, Rada; Hovy, Eduard (2019). "Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances". IEEE Access. 7: 100943–100953. arXiv: 1905.02947 . Bibcode:2019arXiv190502947P. doi:10.1109/ACCESS.2019.2929050. S2CID   147703962.
  2. Lee, Chul Min; Narayanan, Shrikanth (March 2005). "Toward Detecting Emotions in Spoken Dialogs". IEEE Transactions on Speech and Audio Processing. 13 (2): 293–303. doi:10.1109/TSA.2004.838534. S2CID   12710581.
  3. Hazarika, Devamanyu; Poria, Soujanya; Zimmermann, Roger; Mihalcea, Rada (Oct 2019). "Emotion Recognition in Conversations with Transfer Learning from Generative Conversation Modeling". arXiv: 1910.04980 [cs.CL].
  4. Busso, Carlos; Bulut, Murtaza; Lee, Chi-Chun; Kazemzadeh, Abe; Mower, Emily; Kim, Samuel; Chang, Jeannette N.; Lee, Sungbok; Narayanan, Shrikanth S. (2008-11-05). "IEMOCAP: interactive emotional dyadic motion capture database". Language Resources and Evaluation. 42 (4): 335–359. doi:10.1007/s10579-008-9076-6. ISSN   1574-020X. S2CID   11820063.
  5. McKeown, G.; Valstar, M.; Cowie, R.; Pantic, M.; Schroder, M. (2012-01-02). "The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent". IEEE Transactions on Affective Computing. 3 (1): 5–17. doi:10.1109/t-affc.2011.20. ISSN   1949-3045. S2CID   2995377.
  6. Li, Yanran, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. "DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset." In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 986-995. 2017.
  7. Poria, Soujanya; Hazarika, Devamanyu; Majumder, Navonil; Naik, Gautam; Cambria, Erik; Mihalcea, Rada (2019). "MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations". Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics: 527–536. arXiv: 1810.02508 . doi:10.18653/v1/p19-1050. S2CID   52932143.
  8. Abdelwahab, Mohammed; Busso, Carlos (March 2005). "Supervised domain adaptation for emotion recognition from speech". IEEE Transactions on Speech and Audio Processing: 5058–5062. doi:10.1109/ICASSP.2015.7178934. ISBN   978-1-4673-6997-8. S2CID   8207841.
  9. Chernykh, Vladimir; Prikhodko, Pavel; King, Irwin (Jul 2019). "Emotion Recognition From Speech With Recurrent Neural Networks". arXiv: 1701.08071 [cs.CL].
  10. Majumder, Navonil; Poria, Soujanya; Hazarika, Devamanyu; Mihalcea, Rada; Gelbukh, Alexander; Cambria, Erik (2019-07-17). "DialogueRNN: An Attentive RNN for Emotion Detection in Conversations". Proceedings of the AAAI Conference on Artificial Intelligence. 33: 6818–6825. doi: 10.1609/aaai.v33i01.33016818 . ISSN   2374-3468.
  11. "Graph Convolutional Networks are Bringing Emotion Recognition Closer to Machines. Here's how". Tech Times. 2019-11-26. Retrieved February 25, 2020.
  12. Ghosal, Deepanway; Majumder, Navonil; Soujanya, Poria (Aug 2019). DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. Conference on Empirical Methods in Natural Language Processing (EMNLP).
  13. Jiao, Wenxiang; R. Lyu, Michael; King, Irwin (November 2019). "Real-Time Emotion Recognition via Attention Gated Hierarchical Memory Network". arXiv: 1911.09075 [cs.CL].
  14. 1 2 Poria, Soujanya; Majumder, Navonil; Hazarika, Devamanyu; Ghosal, Deepanway; Bhardwaj, Rishabh; Jian, Samson Yu Bai; Hong, Pengfei; Ghosh, Romila; Roy, Abhinaba; Chhaya, Niyati; Gelbukh, Alexander (2021-09-13). "Recognizing Emotion Cause in Conversations". Cognitive Computation. 13 (5): 1317–1332. arXiv: 2012.11820 . doi:10.1007/s12559-021-09925-7. ISSN   1866-9964. S2CID   229349214.