OpenSMILE

Last updated
openSMILE
Developer(s) audEERING GmbH
Initial releaseSeptember 2010;13 years ago (2010-09)
Stable release
3.0.1 [1] / January 4, 2022;2 years ago (2022-01-04)
Written in C++
Platform Linux, macOS, Windows, Android, iOS
Type Machine learning
License Source-available, proprietary
Website audeering.com

openSMILE [2] is source-available software for automatic extraction of features from audio signals and for classification of speech and music signals. "SMILE" stands for "Speech & Music Interpretation by Large-space Extraction". The software is mainly applied in the area of automatic emotion recognition and is widely used in the affective computing research community. The openSMILE project exists since 2008 and is maintained by the German company audEERING GmbH since 2013. openSMILE is provided free of charge for research purposes and personal use under a source-available license. For commercial use of the tool, the company audEERING offers custom license options.

Contents

Application Areas

openSMILE is used for academic research as well as for commercial applications in order to automatically analyze speech and music signals in real-time. In contrast to automatic speech recognition which extracts the spoken content out of a speech signal, openSMILE is capable of recognizing the characteristics of a given speech or music segment. Examples for such characteristics encoded in human speech are a speaker's emotion, [3] age, gender, and personality, as well as speaker states like depression, intoxication, or vocal pathological disorders. The software further includes music classification technology for automatic music mood detection and recognition of chorus segments, key, chords, tempo, meter, dance-style, and genre.

The openSMILE toolkit serves as benchmark in manifold research competitions such as Interspeech ComParE, [4] AVEC, [5] MediaEval, [6] and EmotiW. [7]

History

The openSMILE project was started in 2008 by Florian Eyben, Martin Wöllmer, and Björn Schuller at the Technical University of Munich within the European Union research project SEMAINE. The goal of the SEMAINE project was to develop a virtual agent with emotional and social intelligence. In this system, openSMILE was applied for real-time analysis of speech and emotion. The final SEMAINE software release is based on openSMILE version 1.0.1.

In 2009, the emotion recognition toolkit (openEAR) was published based on openSMILE. "EAR" stands for "Emotion and Affect Recognition".

In 2010, openSMILE version 1.0.1 was published and was introduced and awarded at the ACM Multimedia Open-Source Software Challenge.

Between 2011 and 2013, the technology of openSMILE was extended and improved by Florian Eyben and Felix Weninger in the context of their doctoral thesis at the Technical University of Munich. The software was also applied for the project ASC-Inclusion, which was funded by the European Union. For this project, the software was extended by Erik Marchi in order to teach emotional expression to autistic children, based on automatic emotion recognition and visualization.

In 2013, the company audEERING acquired the rights to the code-base from the Technical University of Munich and version 2.0 was published under a source-available research license.

Until 2016, openSMILE was downloaded more than 50,000 times worldwide and has established itself as a standard toolkit for emotion recognition.

Awards

openSMILE was awarded in 2010 in the context of the ACM Multimedia Open Source Competition. The software tool is applied in numerous scientific publications on automatic emotion recognition. openSMILE [8] and its extension openEAR [9] have been cited in more than 1000 scientific publications until today.

Related Research Articles

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Interactive voice response (IVR) is a technology that allows telephone users to interact with a computer-operated telephone system through the use of voice and DTMF tones input with a keypad. In telephony, IVR allows customers to interact with a company's host system via a telephone keypad or by speech recognition, after which services can be inquired about through the IVR dialogue. IVR systems can respond with pre-recorded or dynamically generated audio to further direct users on how to proceed. IVR systems deployed in the network are sized to handle large call volumes and also used for outbound calling as IVR systems are more intelligent than many predictive dialer systems.

Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer science, psychology, and cognitive science. While some core ideas in the field may be traced as far back as to early philosophical inquiries into emotion, the more modern branch of computer science originated with Rosalind Picard's 1995 paper on affective computing and her book Affective Computing published by MIT Press. One of the motivations for the research is the ability to give machines emotional intelligence, including to simulate empathy. The machine should interpret the emotional state of humans and adapt its behavior to them, giving an appropriate response to those emotions.

<span class="mw-page-title-main">Visual programming language</span> Programming language written graphically by a user

In computing, a visual programming language or block coding is a programming language that lets users create programs by manipulating program elements graphically rather than by specifying them textually. A VPL allows programming with visual expressions, spatial arrangements of text and graphic symbols, used either as elements of syntax or secondary notation. For example, many VPLs are based on the idea of "boxes and arrows", where boxes or other screen objects are treated as entities, connected by arrows, lines or arcs which represent relations.

Multimodal interaction provides the user with multiple modes of interacting with a system. A multimodal interface provides several distinct tools for input and output of data.

<span class="mw-page-title-main">Thomas Huang</span> Chinese-American engineer and computer scientist (1936–2020)

Thomas Shi-Tao Huang was a Chinese-born American computer scientist, electrical engineer, and writer. He was a researcher and professor emeritus at the University of Illinois at Urbana-Champaign (UIUC). Huang was one of the leading figures in computer vision, pattern recognition and human computer interaction.

Speech analytics is the process of analyzing recorded calls to gather customer information to improve communication and future interaction. The process is primarily used by customer contact centers to extract information buried in client interactions with an enterprise. Although speech analytics includes elements of automatic speech recognition, it is known for analyzing the topic being discussed, which is weighed against the emotional character of the speech and the amount and locations of speech versus non-speech during the interaction. Speech analytics in contact centers can be used to mine recorded customer interactions to surface the intelligence essential for building effective cost containment and customer service strategies. The technology can pinpoint cost drivers, trend analysis, identify strengths and weaknesses with processes and products, and help understand how the marketplace perceives offerings.

Automatic pronunciation assessment is the use of speech recognition to verify the correctness of pronounced speech, as distinguished from manual assessment by an instructor or proctor. Also called speech verification, pronunciation evaluation, and pronunciation scoring, the main application of this technology is computer-aided pronunciation teaching (CAPT) when combined with computer-aided instruction for computer-assisted language learning (CALL), speech remediation, or accent reduction. Pronunciation assessment does not determine unknown speech but instead, knowing the expected word(s) in advance, it attempts to verify the correctness of the learner's pronunciation and ideally their intelligibility to listeners, sometimes along with often inconsequential prosody such as intonation, pitch, tempo, rhythm, and stress. Pronunciation assessment is also used in reading tutoring, for example in products such as Microsoft Teams and from Amira Learning. Automatic pronunciation assessment can also be used to help diagnose and treat speech disorders such as apraxia.

As of the early 2000s, several speech recognition (SR) software packages exist for Linux. Some of them are free and open-source software and others are proprietary software. Speech recognition usually refers to software that attempts to distinguish thousands of words in a human language. Voice control may refer to software used for communicating operational commands to a computer.

ACM Multimedia (ACM-MM) is the Association for Computing Machinery (ACM)'s annual conference on multimedia, sponsored by the SIGMM special interest group on multimedia in the ACM. SIGMM specializes in the field of multimedia computing, from underlying technologies to applications, theory to practice, and servers to networks to devices.

RWTH ASR is a proprietary speech recognition toolkit.

Julia Hirschberg is an American computer scientist noted for her research on computational linguistics and natural language processing.

Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Generally, the technology works best if it uses multiple modalities in context. To date, the most work has been conducted on automating the recognition of facial expressions from video, spoken expressions from audio, written expressions from text, and physiology as measured by wearables.

Multimodal sentiment analysis is a technology for traditional text-based sentiment analysis, which includes modalities such as audio and visual data. It can be bimodal, which includes different combinations of two modalities, or trimodal, which incorporates three modalities. With the extensive amount of social media data available online in different forms such as videos and images, the conventional text-based sentiment analysis has evolved into more complex models of multimodal sentiment analysis, which can be applied in the development of virtual assistants, analysis of YouTube movie reviews, analysis of news videos, and emotion recognition such as depression monitoring, among others.

Personality computing is a research field related to artificial intelligence and personality psychology that studies personality by means of computational techniques from different sources, including text, multimedia and social networks.

<span class="mw-page-title-main">Voice computing</span> Discipline in computing

Voice computing is the discipline that develops hardware or software to process voice inputs.

<span class="mw-page-title-main">Björn Schuller</span>

Björn Wolfgang Schuller is a scientist of electrical engineering, information technology and computer science as well as entrepreneur. He is professor of artificial intelligence at Imperial College London., UK, and holds the chair of embedded intelligence for healthcare and wellbeing at the University of Augsburg in Germany. He was a university professor and holder of the chair of complex and intelligent systems at the University of Passau in Germany. He is also co-founder and managing director as well as the current chief scientific officer (CSO) of audEERING GmbH, Germany, as well as permanent visiting professor at the Harbin Institute of Technology in the People's Republic of China and associate of CISA at the University of Geneva in French-speaking Switzerland.

Emily Mower Provost is a professor of computer science at the University of Michigan. She directs the Computational Human-Centered Artificial Intelligence (CHAI) Laboratory.

An audio deepfake is a type of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.

References

  1. "Release openSMILE 3.0.1" . Retrieved 5 January 2022.
  2. F. Eyben, M. Wöllmer, B. Schuller: „openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor“, In Proc. ACM Multimedia (MM), ACM, Florence, Italy, ACM, pp. 1459-1462, October 2010.
  3. B. Schuller, B. Vlasenko, F. Eyben, M. Wöllmer, A. Stuhlsatz, A. Wendemuth, G. Rigoll, "Cross-Corpus Acoustic Emotion Recognition: Variances and Strategies (Extended Abstract)," in Proc. of ACII 2015, Xi'an, China, invited for the Special Session on Most Influential Articles in IEEE Transactions on Affective Computing.
  4. B. Schuller, S. Steidl, A. Batliner, J. Hirschberg, J. K. Burgoon, A. Elkins, Y. Zhang, E. Coutinho: "The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception & Sincerity Archived 2017-06-09 at the Wayback Machine ", Proceedings INTERSPEECH 2016, ISCA, San Francisco, USA, 2016.
  5. F. Ringeval, B. Schuller, M. Valstar, R. Cowie, M. Pantic, “AVEC 2015 - The 5th International Audio/Visual Emotion Challenge and Workshop,” in Proceedings of the 23rd ACM International Conference on Multimedia, MM 2015, (Brisbane, Australia), ACM, October 2015.
  6. M. Eskevich, R. Aly, D. Racca, R. Ordelman, S. Chen, G. J. Jones, "The search and hyperlinking task at MediaEval 2014".
  7. F. Ringeval, S. Amiriparian, F. Eyben, K. Scherer, B. Schuller, “Emotion Recognition in the Wild: Incorporating Voice and Lip Activity in Multimodal Decision-Level Fusion,” in Proceedings of the ICMI 2014 EmotiW – Emotion Recognition In The Wild Challenge and Workshop (EmotiW 2014), Satellite of the 16th ACM International Conference on Multimodal Interaction (ICMI 2014), (Istanbul, Turkey), pp. 473– 480, ACM, November 2014
  8. Eyben, Florian; Wöllmer, Martin; Schuller, Björn (26 April 2018). "Opensmile: the munich versatile and fast open-source audio feature extractor". ACM. pp. 1459–1462 via Google Scholar.
  9. Eyben, Florian; Wöllmer, Martin; Schuller, Björn (26 April 2018). "OpenEAR—introducing the Munich open-source emotion and affect recognition toolkit". IEEE. pp. 1–6 via Google Scholar.