Computer audition

Last updated

Computer audition (CA) or machine listening is the general field of study of algorithms and systems for audio interpretation by machines. [1] [2] Since the notion of what it means for a machine to "hear" is very broad and somewhat vague, computer audition attempts to bring together several disciplines that originally dealt with specific problems or had a concrete application in mind. The engineer Paris Smaragdis, interviewed in Technology Review , talks about these systems "software that uses sound to locate people moving through rooms, monitor machinery for impending breakdowns, or activate traffic cameras to record accidents." [3]

Contents

Inspired by models of human audition, CA deals with questions of representation, transduction, grouping, use of musical knowledge and general sound semantics for the purpose of performing intelligent operations on audio and music signals by the computer. Technically this requires a combination of methods from the fields of signal processing, auditory modelling, music perception and cognition, pattern recognition, and machine learning, as well as more traditional methods of artificial intelligence for musical knowledge representation. [4] [5]

Applications

Like computer vision versus image processing, computer audition versus audio engineering deals with understanding of audio rather than processing. It also differs from problems of speech understanding by machine since it deals with general audio signals, such as natural sounds and musical recordings.

Applications of computer audition are widely varying, and include search for sounds, genre recognition, acoustic monitoring, music transcription, score following, audio texture, music improvisation, emotion in audio and so on.

Computer Audition overlaps with the following disciplines:

Areas of study

Since audio signals are interpreted by the human ear–brain system, that complex perceptual mechanism should be simulated somehow in software for "machine listening". In other words, to perform on par with humans, the computer should hear and understand audio content much as humans do. Analyzing audio accurately involves several fields: electrical engineering (spectrum analysis, filtering, and audio transforms); artificial intelligence (machine learning and sound classification); [6] psychoacoustics (sound perception); cognitive sciences (neuroscience and artificial intelligence); [7] acoustics (physics of sound production); and music (harmony, rhythm, and timbre). Furthermore, audio transformations such as pitch shifting, time stretching, and sound object filtering, should be perceptually and musically meaningful. For best results, these transformations require perceptual understanding of spectral models, high-level feature extraction, and sound analysis/synthesis. Finally, structuring and coding the content of an audio file (sound and metadata) could benefit from efficient compression schemes, which discard inaudible information in the sound. [8] Computational models of music and sound perception and cognition can lead to a more meaningful representation, a more intuitive digital manipulation and generation of sound and music in musical human-machine interfaces.

The study of CA could be roughly divided into the following sub-problems:

  1. Representation: signal and symbolic. This aspect deals with time-frequency representations, both in terms of notes and spectral models, including pattern playback and audio texture.
  2. Feature extraction: sound descriptors, segmentation, onset, pitch and envelope detection, chroma, and auditory representations.
  3. Musical knowledge structures: analysis of tonality, rhythm, and harmonies.
  4. Sound similarity: methods for comparison between sounds, sound identification, novelty detection, segmentation, and clustering.
  5. Sequence modeling: matching and alignment between signals and note sequences.
  6. Source separation: methods of grouping of simultaneous sounds, such as multiple pitch detection and time-frequency clustering methods.
  7. Auditory cognition: modeling of emotions, anticipation and familiarity, auditory surprise, and analysis of musical structure.
  8. Multi-modal analysis: finding correspondences between textual, visual, and audio signals.

Representation issues

Computer audition deals with audio signals that can be represented in a variety of fashions, from direct encoding of digital audio in two or more channels to symbolically represented synthesis instructions. Audio signals are usually represented in terms of analogue or digital recordings. Digital recordings are samples of acoustic waveform or parameters of audio compression algorithms. One of the unique properties of musical signals is that they often combine different types of representations, such as graphical scores and sequences of performance actions that are encoded as MIDI files.

Since audio signals usually comprise multiple sound sources, then unlike speech signals that can be efficiently described in terms of specific models (such as source-filter model), it is hard to devise a parametric representation for general audio. Parametric audio representations usually use filter banks or sinusoidal models to capture multiple sound parameters, sometimes increasing the representation size in order to capture internal structure in the signal. Additional types of data that are relevant for computer audition are textual descriptions of audio contents, such as annotations, reviews, and visual information in the case of audio-visual recordings.

Features

Description of contents of general audio signals usually requires extraction of features that capture specific aspects of the audio signal. Generally speaking, one could divide the features into signal or mathematical descriptors such as energy, description of spectral shape etc., statistical characterization such as change or novelty detection, special representations that are better adapted to the nature of musical signals or the auditory system, such as logarithmic growth of sensitivity (bandwidth) in frequency or octave invariance (chroma).

Since parametric models in audio usually require very many parameters, the features are used to summarize properties of multiple parameters in a more compact or salient representation.

Musical knowledge

Finding specific musical structures is possible by using musical knowledge as well as supervised and unsupervised machine learning methods. Examples of this include detection of tonality according to distribution of frequencies that correspond to patterns of occurrence of notes in musical scales, distribution of note onset times for detection of beat structure, distribution of energies in different frequencies to detect musical chords and so on.

Sound similarity and sequence modeling

Comparison of sounds can be done by comparison of features with or without reference to time. In some cases an overall similarity can be assessed by close values of features between two sounds. In other cases when temporal structure is important, methods of dynamic time warping need to be applied to "correct" for different temporal scales of acoustic events. Finding repetitions and similar sub-sequences of sonic events is important for tasks such as texture synthesis and machine improvisation.

Source separation

Since one of the basic characteristics of general audio is that it comprises multiple simultaneously sounding sources, such as multiple musical instruments, people talking, machine noises or animal vocalization, the ability to identify and separate individual sources is very desirable. Unfortunately, there are no methods that can solve this problem in a robust fashion. Existing methods of source separation rely sometimes on correlation between different audio channels in multi-channel recordings. The ability to separate sources from stereo signals requires different techniques than those usually applied in communications where multiple sensors are available. Other source separation methods rely on training or clustering of features in mono recording, such as tracking harmonically related partials for multiple pitch detection. Some methods, before explicit recognition, rely on revealing structures in data without knowing the structures (like recognizing objects in abstract pictures without attributing them meaningful labels) by finding the least complex data representations, for instance describing audio scenes as generated by a few tone patterns and their trajectories (polyphonic voices) and acoustical contours drawn by a tone (chords). [9]

Auditory cognition

Listening to music and general audio is commonly not a task directed activity. People enjoy music for various poorly understood reasons, which are commonly referred to the emotional effect of music due to creation of expectations and their realization or violation. Animals attend to signs of danger in sounds, which could be either specific or general notions of surprising and unexpected change. Generally, this creates a situation where computer audition can not rely solely on detection of specific features or sound properties and has to come up with general methods of adapting to changing auditory environment and monitoring its structure. This consists of analysis of larger repetition and self-similarity structures in audio to detect innovation, as well as ability to predict local feature dynamics.

Multi-modal analysis

Among the available data for describing music, there are textual representations, such as liner notes, reviews and criticisms that describe the audio contents in words. In other cases human reactions such as emotional judgements or psycho-physiological measurements might provide an insight into the contents and structure of audio. Computer Audition tries to find relation between these different representations in order to provide this additional understanding of the audio contents.

See also

Related Research Articles

Audio signal processing is a subfield of signal processing that is concerned with the electronic manipulation of audio signals. Audio signals are electronic representations of sound waves—longitudinal waves which travel through air, consisting of compressions and rarefactions. The energy contained in audio signals or sound power level is typically measured in decibels. As audio signals may be represented in either digital or analog format, processing may occur in either domain. Analog processors operate directly on the electrical signal, while digital processors operate mathematically on its digital representation.

Music information retrieval (MIR) is the interdisciplinary science of retrieving information from music. Those involved in MIR may have a background in academic musicology, psychoacoustics, psychology, signal processing, informatics, machine learning, optical music recognition, computational intelligence or some combination of these.

Algorithmic composition is the technique of using algorithms to create music.

Audio analysis refers to the extraction of information and meaning from audio signals for analysis, classification, storage, retrieval, synthesis, etc. The observation mediums and interpretation methods vary, as audio analysis can refer to the human ear and how people interpret the audible sound source, or it could refer to using technology such as an audio analyzer to evaluate other qualities of a sound source such as amplitude, distortion, frequency response. Once an audio source's information has been observed, the information revealed can then be processed for the logical, emotional, descriptive, or otherwise relevant interpretation by the user.

Sensory substitution is a change of the characteristics of one sensory modality into stimuli of another sensory modality.

Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.

Perceptual Evaluation of Audio Quality (PEAQ) is a standardized algorithm for objectively measuring perceived audio quality, developed in 1994–1998 by a joint venture of experts within Task Group 6Q of the International Telecommunication Union's Radiocommunication Sector (ITU-R). It was originally released as ITU-R Recommendation BS.1387 in 1998 and last updated in 2023. It utilizes software to simulate perceptual properties of the human ear and then integrates multiple model output variables into a single metric.

<span class="mw-page-title-main">Auditory scene analysis</span>

In perception and psychophysics, auditory scene analysis (ASA) is a proposed model for the basis of auditory perception. This is understood as the process by which the human auditory system organizes sound into perceptually meaningful elements. The term was coined by psychologist Albert Bregman. The related concept in machine perception is computational auditory scene analysis (CASA), which is closely related to source separation and blind signal separation.

Computational auditory scene analysis (CASA) is the study of auditory scene analysis by computational means. In essence, CASA systems are "machine listening" systems that aim to separate mixtures of sound sources in the same way that human listeners do. CASA differs from the field of blind signal separation in that it is based on the mechanisms of the human auditory system, and thus uses no more than two microphone recordings of an acoustic environment. It is related to the cocktail party problem.

Machine perception is the capability of a computer system to interpret data in a manner that is similar to the way humans use their senses to relate to the world around them. The basic method that the computers take in and respond to their environment is through the attached hardware. Until recently input was limited to a keyboard, or a mouse, but advances in technology, both in hardware and software, have allowed computers to take in sensory input in a way similar to humans.

Music informatics is a study of music processing, in particular music representations, fourier analysis of music, music synchronization, music structure analysis and chord recognition. Other music informatics research topics include computational music modeling, computational music analysis, optical music recognition, digital audio editors, online music search engines, music information retrieval and cognitive issues in music. Because music informatics is an emerging discipline, it is a very dynamic area of research with many diverse viewpoints, whose future is yet to be determined.

<span class="mw-page-title-main">Albert Bregman</span> Canadian psychologist and academic (1936–2023)

Albert Stanley Bregman was a Canadian academic and researcher in experimental psychology, cognitive science, and Gestalt psychology, primarily in the perceptual organization of sound.

<span class="mw-page-title-main">Music and artificial intelligence</span> Common subject in the International Computer Music Conference

Artificial intelligence and music (AIM) is a common subject in the International Computer Music Conference, the Computing Society Conference and the International Joint Conference on Artificial Intelligence. The first International Computer Music Conference (ICMC) was held in 1974 at Michigan State University. Current research includes the application of AI in music composition, performance, theory and digital sound processing.

Computational musicology is an interdisciplinary research area between musicology and computer science. Computational musicology includes any disciplines that use computation in order to study music. It includes sub-disciplines such as mathematical music theory, computer music, systematic musicology, music information retrieval, digital musicology, sound and music computing, and music informatics. As this area of research is defined by the tools that it uses and its subject matter, research in computational musicology intersects with both the humanities and the sciences. The use of computers in order to study and analyze music generally began in the 1960s, although musicians have been using computers to assist them in the composition of music beginning in the 1950s. Today, computational musicology encompasses a wide range of research topics dealing with the multiple ways music can be represented.

Cognitive musicology is a branch of cognitive science concerned with computationally modeling musical knowledge with the goal of understanding both music and cognition.

Psychoacoustics is the branch of psychophysics involving the scientific study of sound perception and audiology—how human auditory system perceives various sounds. More specifically, it is the branch of science studying the psychological responses associated with sound. Psychoacoustics is an interdisciplinary field of many areas, including psychology, acoustics, electronic engineering, physics, biology, physiology, and computer science.

Sound and music computing (SMC) is a research field that studies the whole sound and music communication chain from a multidisciplinary point of view. By combining scientific, technological and artistic methodologies it aims at understanding, modeling and generating sound and music through computational approaches.

<span class="mw-page-title-main">International Society for Music Information Retrieval</span>

The International Society for Music Information Retrieval (ISMIR) is an international forum for research on the organization of music-related data. It started as an informal group steered by an ad hoc committee in 2000 which established a yearly symposium - whence "ISMIR", which meant International Symposium on Music Information Retrieval. It was turned into a conference in 2002 while retaining the acronym. ISMIR was incorporated in Canada on July 4, 2008.

Perceptual-based 3D sound localization is the application of knowledge of the human auditory system to develop 3D sound localization technology.

Temporal envelope (ENV) and temporal fine structure (TFS) are changes in the amplitude and frequency of sound perceived by humans over time. These temporal changes are responsible for several aspects of auditory perception, including loudness, pitch and timbre perception and spatial hearing.

References

  1. Machine Audition: Principles, Algorithms and Systems. IGI Global. 2011. ISBN   9781615209194.
  2. "Machine Audition: Principles, Algorithms and Systems" (PDF).
  3. Paris Smaragdis taught computers how to play more life-like music
  4. Tanguiane (Tangian), Andranick (1993). Artificial Perception and Music Recognition. Lecture Notes in Artificial Intelligence. Vol. 746. Berlin-Heidelberg: Springer. ISBN   978-3-540-57394-4.
  5. Tanguiane (Tanguiane), Andranick (1994). "A principle of correlativity of perception and its application to music recognition". Music Perception. 11 (4): 465–502. doi:10.2307/40285634. JSTOR   40285634.
  6. Kelly, Daniel; Caulfield, Brian (Feb 2015). "Pervasive Sound Sensing: A Weakly Supervised Training Approach". IEEE Transactions on Cybernetics. 46 (1): 123–135. doi:10.1109/TCYB.2015.2396291. hdl: 10197/6853 . PMID   25675471. S2CID   16042016.
  7. Hendrik Purwins, Perfecto Herrera, Maarten Grachten, Amaury Hazan, Ricard Marxer, and Xavier Serra. Computational models of music perception and cognition I: The perceptual and cognitive processing chain. Physics of Life Reviews, vol. 5, no. 3, pp. 151-168, 2008.
  8. Machine Listening Course Webpage at MIT
  9. Tanguiane (Tangian), Andranick (1995). "Towards axiomatization of music perception". Journal of New Music Research. 24 (3): 247–281. doi:10.1080/09298219508570685.