Multimodal sentiment analysis

Last updated

Multimodal sentiment analysis is a technology for traditional text-based sentiment analysis, which includes modalities such as audio and visual data. [1] It can be bimodal, which includes different combinations of two modalities, or trimodal, which incorporates three modalities. [2] With the extensive amount of social media data available online in different forms such as videos and images, the conventional text-based sentiment analysis has evolved into more complex models of multimodal sentiment analysis, [3] [4] which can be applied in the development of virtual assistants, [5] analysis of YouTube movie reviews, [6] analysis of news videos, [7] and emotion recognition (sometimes known as emotion detection) such as depression monitoring, [8] among others.

Contents

Similar to the traditional sentiment analysis, one of the most basic task in multimodal sentiment analysis is sentiment classification, which classifies different sentiments into categories such as positive, negative, or neutral. [9] The complexity of analyzing text, audio, and visual features to perform such a task requires the application of different fusion techniques, such as feature-level, decision-level, and hybrid fusion. [3] The performance of these fusion techniques and the classification algorithms applied, are influenced by the type of textual, audio, and visual features employed in the analysis. [10]

Features

Feature engineering, which involves the selection of features that are fed into machine learning algorithms, plays a key role in the sentiment classification performance. [10] In multimodal sentiment analysis, a combination of different textual, audio, and visual features are employed. [3]

Textual features

Similar to the conventional text-based sentiment analysis, some of the most commonly used textual features in multimodal sentiment analysis are unigrams and n-grams, which are basically a sequence of words in a given textual document. [11] These features are applied using bag-of-words or bag-of-concepts feature representations, in which words or concepts are represented as vectors in a suitable space. [12] [13]

Audio features

Sentiment and emotion characteristics are prominent in different phonetic and prosodic properties contained in audio features. [14] Some of the most important audio features employed in multimodal sentiment analysis are mel-frequency cepstrum (MFCC), spectral centroid, spectral flux, beat histogram, beat sum, strongest beat, pause duration, and pitch. [3] OpenSMILE [15] and Praat are popular open-source toolkits for extracting such audio features. [16]

Visual features

One of the main advantages of analyzing videos with respect to texts alone, is the presence of rich sentiment cues in visual data. [17] Visual features include facial expressions, which are of paramount importance in capturing sentiments and emotions, as they are a main channel of forming a person's present state of mind. [3] Specifically, smile, is considered to be one of the most predictive visual cues in multimodal sentiment analysis. [12] OpenFace is an open-source facial analysis toolkit available for extracting and understanding such visual features. [18]

Fusion techniques

Unlike the traditional text-based sentiment analysis, multimodal sentiment analysis undergo a fusion process in which data from different modalities (text, audio, or visual) are fused and analyzed together. [3] The existing approaches in multimodal sentiment analysis data fusion can be grouped into three main categories: feature-level, decision-level, and hybrid fusion, and the performance of the sentiment classification depends on which type of fusion technique is employed. [3]

Feature-level fusion

Feature-level fusion (sometimes known as early fusion) gathers all the features from each modality (text, audio, or visual) and joins them together into a single feature vector, which is eventually fed into a classification algorithm. [19] One of the difficulties in implementing this technique is the integration of the heterogeneous features. [3]

Decision-level fusion

Decision-level fusion (sometimes known as late fusion), feeds data from each modality (text, audio, or visual) independently into its own classification algorithm, and obtains the final sentiment classification results by fusing each result into a single decision vector. [19] One of the advantages of this fusion technique is that it eliminates the need to fuse heterogeneous data, and each modality can utilize its most appropriate classification algorithm. [3]

Hybrid fusion

Hybrid fusion is a combination of feature-level and decision-level fusion techniques, which exploits complementary information from both methods during the classification process. [6] It usually involves a two-step procedure wherein feature-level fusion is initially performed between two modalities, and decision-level fusion is then applied as a second step, to fuse the initial results from the feature-level fusion, with the remaining modality. [20] [21]

Applications

Similar to text-based sentiment analysis, multimodal sentiment analysis can be applied in the development of different forms of recommender systems such as in the analysis of user-generated videos of movie reviews [6] and general product reviews, [22] to predict the sentiments of customers, and subsequently create product or service recommendations. [23] Multimodal sentiment analysis also plays an important role in the advancement of virtual assistants through the application of natural language processing (NLP) and machine learning techniques. [5] In the healthcare domain, multimodal sentiment analysis can be utilized to detect certain medical conditions such as stress, anxiety, or depression. [8] Multimodal sentiment analysis can also be applied in understanding the sentiments contained in video news programs, which is considered as a complicated and challenging domain, as sentiments expressed by reporters tend to be less obvious or neutral. [24]

Related Research Articles

<span class="mw-page-title-main">Affective computing</span> Area of research in computer science aiming to understand the emotional state of users

Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer science, psychology, and cognitive science. While some core ideas in the field may be traced as far back as to early philosophical inquiries into emotion, the more modern branch of computer science originated with Rosalind Picard's 1995 paper on affective computing and her book Affective Computing published by MIT Press. One of the motivations for the research is the ability to give machines emotional intelligence, including to simulate empathy. The machine should interpret the emotional state of humans and adapt its behavior to them, giving an appropriate response to those emotions.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005), there are three perspectives of text mining: information extraction, data mining, and knowledge discovery in databases (KDD). Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

<span class="mw-page-title-main">Simultaneous localization and mapping</span> Computational navigational technique used by robots and autonomous vehicles

Simultaneous localization and mapping (SLAM) is the computational problem of constructing or updating a map of an unknown environment while simultaneously keeping track of an agent's location within it. While this initially appears to be a chicken or the egg problem, there are several algorithms known to solve it in, at least approximately, tractable time for certain environments. Popular approximate solution methods include the particle filter, extended Kalman filter, covariance intersection, and GraphSLAM. SLAM algorithms are based on concepts in computational geometry and computer vision, and are used in robot navigation, robotic mapping and odometry for virtual reality or augmented reality.

<span class="mw-page-title-main">Sensor fusion</span> Combining of sensor data from disparate sources

Sensor fusion is a process of combining sensor data or data derived from disparate sources so that the resulting information has less uncertainty than would be possible if these sources were used individually. For instance, one could potentially obtain a more accurate location estimate of an indoor object by combining multiple data sources such as video cameras and WiFi localization signals. The term uncertainty reduction in this case can mean more accurate, more complete, or more dependable, or refer to the result of an emerging view, such as stereoscopic vision.

Multimodal interaction provides the user with multiple modes of interacting with a system. A multimodal interface provides several distinct tools for input and output of data.

Information integration (II) is the merging of information from heterogeneous sources with differing conceptual, contextual and typographical representations. It is used in data mining and consolidation of data from unstructured or semi-structured resources. Typically, information integration refers to textual representations of knowledge but is sometimes applied to rich-media content. Information fusion, which is a related term, involves the combination of information into a new set of information towards reducing redundancy and uncertainty.

<span class="mw-page-title-main">Thomas Huang</span> Chinese-American engineer and computer scientist (1936–2020)

Thomas Shi-Tao Huang was a Chinese-born American computer scientist, electrical engineer, and writer. He was a researcher and professor emeritus at the University of Illinois at Urbana-Champaign (UIUC). Huang was one of the leading figures in computer vision, pattern recognition and human computer interaction.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

<span class="mw-page-title-main">Multimodality</span> Concept in communication

Multimodality is the application of multiple literacies within one medium. Multiple literacies or "modes" contribute to an audience's understanding of a composition. Everything from the placement of images to the organization of the content to the method of delivery creates meaning. This is the result of a shift from isolated text being relied on as the primary source of communication, to the image being utilized more frequently in the digital age. Multimodality describes communication practices in terms of the textual, aural, linguistic, spatial, and visual resources used to compose messages.

Bing Liu is a Chinese-American professor of computer science who specializes in data mining, machine learning, and natural language processing. In 2002, he became a scholar at University of Illinois at Chicago. He holds a PhD from the University of Edinburgh (1988). His PhD advisors were Austin Tate and Kenneth Williamson Currie, and his PhD thesis was titled Reinforcement Planning for Resource Allocation and Constraint Satisfaction.

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Generally, the technology works best if it uses multiple modalities in context. To date, the most work has been conducted on automating the recognition of facial expressions from video, spoken expressions from audio, written expressions from text, and physiology as measured by wearables.

<span class="mw-page-title-main">Amir Hussain (cognitive scientist)</span>

Amir Hussain is a cognitive scientist, the director of Cognitive Big Data and Cybersecurity (CogBID) Research Lab at Edinburgh Napier University He is a professor of computing science. He is founding Editor-in-Chief of Springer Nature's internationally leading Cognitive Computation journal and the new Big Data Analytics journal. He is founding Editor-in-Chief for two Springer Book Series: Socio-Affective Computing and Cognitive Computation Trends, and also serves on the Editorial Board of a number of other world-leading journals including, as Associate Editor for the IEEE Transactions on Neural Networks and Learning Systems, IEEE Transactions on Systems, Man, and Cybernetics (Systems) and the IEEE Computational Intelligence Magazine.

openSMILE is source-available software for automatic extraction of features from audio signals and for classification of speech and music signals. "SMILE" stands for "Speech & Music Interpretation by Large-space Extraction". The software is mainly applied in the area of automatic emotion recognition and is widely used in the affective computing research community. The openSMILE project exists since 2008 and is maintained by the German company audEERING GmbH since 2013. openSMILE is provided free of charge for research purposes and personal use under a source-available license. For commercial use of the tool, the company audEERING offers custom license options.

A latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another. Position within the latent space can be viewed as being defined by a set of latent variables that emerge from the resemblances from the objects.

Emotion recognition in conversation (ERC) is a sub-field of emotion recognition, that focuses on mining human emotions from conversations or dialogues having two or more interlocutors. The datasets in this field are usually derived from social platforms that allow free and plenty of samples, often containing multimodal data. Self- and inter-personal influences play critical role in identifying some basic emotions, such as, fear, anger, joy, surprise, etc. The more fine grained the emotion labels are the harder it is to detect the correct emotion. ERC poses a number of challenges, such as, conversational-context modeling, speaker-state modeling, presence of sarcasm in conversation, emotion shift across consecutive utterances of the same interlocutor.

Hatice Gunes is a Turkish computer scientist who is Professor of Affective Intelligence & Robotics at the University of Cambridge. Gunes leads the Affective Intelligence & Robotics Lab. Her research considers human robot interactions and the development of sophisticated technologies with emotional intelligence.

References

  1. Soleymani, Mohammad; Garcia, David; Jou, Brendan; Schuller, Björn; Chang, Shih-Fu; Pantic, Maja (September 2017). "A survey of multimodal sentiment analysis". Image and Vision Computing. 65: 3–14. doi:10.1016/j.imavis.2017.08.003. S2CID   19491070.
  2. Karray, Fakhreddine; Milad, Alemzadeh; Saleh, Jamil Abou; Mo Nours, Arab (2008). "Human-Computer Interaction: Overview on State of the Art" (PDF). International Journal on Smart Sensing and Intelligent Systems. 1: 137–159. doi: 10.21307/ijssis-2017-283 .
  3. 1 2 3 4 5 6 7 8 9 Poria, Soujanya; Cambria, Erik; Bajpai, Rajiv; Hussain, Amir (September 2017). "A review of affective computing: From unimodal analysis to multimodal fusion". Information Fusion. 37: 98–125. doi:10.1016/j.inffus.2017.02.003. hdl: 1893/25490 . S2CID   205433041.
  4. Nguyen, Quy Hoang; Nguyen, Minh-Van Truong; Van Nguyen, Kiet (2024-05-01). "New Benchmark Dataset and Fine-Grained Cross-Modal Fusion Framework for Vietnamese Multimodal Aspect-Category Sentiment Analysis". arXiv: 2405.00543 [cs.CL].
  5. 1 2 "Google AI to make phone calls for you". BBC News. 8 May 2018. Retrieved 12 June 2018.
  6. 1 2 3 Wollmer, Martin; Weninger, Felix; Knaup, Tobias; Schuller, Bjorn; Sun, Congkai; Sagae, Kenji; Morency, Louis-Philippe (May 2013). "YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context" (PDF). IEEE Intelligent Systems. 28 (3): 46–53. doi:10.1109/MIS.2013.34. S2CID   12789201.
  7. Pereira, Moisés H. R.; Pádua, Flávio L. C.; Pereira, Adriano C. M.; Benevenuto, Fabrício; Dalip, Daniel H. (9 April 2016). "Fusing Audio, Textual and Visual Features for Sentiment Analysis of News Videos". arXiv: 1604.02612 [cs.CL].
  8. 1 2 Zucco, Chiara; Calabrese, Barbara; Cannataro, Mario (November 2017). "Sentiment analysis and affective computing for depression monitoring". 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE. pp. 1988–1995. doi:10.1109/bibm.2017.8217966. ISBN   978-1-5090-3050-7. S2CID   24408937.
  9. Pang, Bo; Lee, Lillian (2008). Opinion mining and sentiment analysis. Hanover, MA: Now Publishers. ISBN   978-1601981509.
  10. 1 2 Sun, Shiliang; Luo, Chen; Chen, Junyu (July 2017). "A review of natural language processing techniques for opinion mining systems". Information Fusion. 36: 10–25. doi:10.1016/j.inffus.2016.10.004.
  11. Yadollahi, Ali; Shahraki, Ameneh Gholipour; Zaiane, Osmar R. (25 May 2017). "Current State of Text Sentiment Analysis from Opinion to Emotion Mining". ACM Computing Surveys. 50 (2): 1–33. doi:10.1145/3057270. S2CID   5275807.
  12. 1 2 Perez Rosas, Veronica; Mihalcea, Rada; Morency, Louis-Philippe (May 2013). "Multimodal Sentiment Analysis of Spanish Online Videos". IEEE Intelligent Systems. 28 (3): 38–45. doi:10.1109/MIS.2013.9. S2CID   1132247.
  13. Poria, Soujanya; Cambria, Erik; Hussain, Amir; Huang, Guang-Bin (March 2015). "Towards an intelligent framework for multimodal affective data analysis". Neural Networks. 63: 104–116. doi:10.1016/j.neunet.2014.10.005. hdl: 1893/21310 . PMID   25523041. S2CID   342649.
  14. Chung-Hsien Wu; Wei-Bin Liang (January 2011). "Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels". IEEE Transactions on Affective Computing. 2 (1): 10–21. doi:10.1109/T-AFFC.2010.16. S2CID   52853112.
  15. Eyben, Florian; Wöllmer, Martin; Schuller, Björn (2009). "OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit". OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit - IEEE Conference Publication. p. 1. doi:10.1109/ACII.2009.5349350. ISBN   978-1-4244-4800-5. S2CID   2081569.
  16. Morency, Louis-Philippe; Mihalcea, Rada; Doshi, Payal (14 November 2011). "Towards multimodal sentiment analysis". Towards multimodal sentiment analysis: harvesting opinions from the web. ACM. pp. 169–176. doi:10.1145/2070481.2070509. ISBN   9781450306416. S2CID   1257599.
  17. Poria, Soujanya; Cambria, Erik; Hazarika, Devamanyu; Majumder, Navonil; Zadeh, Amir; Morency, Louis-Philippe (2017). "Context-Dependent Sentiment Analysis in User-Generated Videos". Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers): 873–883. doi: 10.18653/v1/p17-1081 .
  18. OpenFace: An open source facial behavior analysis toolkit - IEEE Conference Publication. March 2016. doi:10.1109/WACV.2016.7477553. ISBN   978-1-5090-0641-0. S2CID   1919851.
  19. 1 2 Poria, Soujanya; Cambria, Erik; Howard, Newton; Huang, Guang-Bin; Hussain, Amir (January 2016). "Fusing audio, visual and textual clues for sentiment analysis from multimodal content". Neurocomputing. 174: 50–59. doi:10.1016/j.neucom.2015.01.095. S2CID   15287807.
  20. Shahla, Shahla; Naghsh-Nilchi, Ahmad Reza (2017). "Exploiting evidential theory in the fusion of textual, audio, and visual modalities for affective music video retrieval - IEEE Conference Publication". doi:10.1109/PRIA.2017.7983051. S2CID   24466718.{{cite journal}}: Cite journal requires |journal= (help)
  21. Poria, Soujanya; Peng, Haiyun; Hussain, Amir; Howard, Newton; Cambria, Erik (October 2017). "Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis". Neurocomputing. 261: 217–230. doi:10.1016/j.neucom.2016.09.117.
  22. Pérez-Rosas, Verónica; Mihalcea, Rada; Morency, Louis Philippe (1 January 2013). "Utterance-level multimodal sentiment analysis". Long Papers. Association for Computational Linguistics (ACL).
  23. Chui, Michael; Manyika, James; Miremadi, Mehdi; Henke, Nicolaus; Chung, Rita; Nel, Pieter; Malhotra, Sankalp. "Notes from the AI frontier. Insights from hundreds of use cases". McKinsey & Company. Retrieved 13 June 2018.
  24. Ellis, Joseph G.; Jou, Brendan; Chang, Shih-Fu (12 November 2014). "Why We Watch the News". Why We Watch the News: A Dataset for Exploring Sentiment in Broadcast Video News. ACM. pp. 104–111. doi:10.1145/2663204.2663237. ISBN   9781450328852. S2CID   14112246.