In Western music, the term chroma feature or chromagram closely relates to twelve different pitch classes. Chroma-based features, which are also referred to as "pitch class profiles", are a powerful tool for analyzing music whose pitches can be meaningfully categorized (often into twelve categories) and whose tuning approximates to the equal-tempered scale. One main property of chroma features is that they capture harmonic and melodic characteristics of music, while being robust to changes in timbre and instrumentation.
The underlying observation is that humans perceive two musical pitches as similar in color if they differ by an octave. Based on this observation, a pitch can be separated into two components, which are referred to as tone height and chroma. [1] Assuming the equal-tempered scale, one considers twelve chroma values represented by the set
that consists of the twelve pitch spelling attributes as used in Western music notation. Note that in the equal-tempered scale different pitch spellings such C♯ and D♭ refer to the same chroma. Enumerating the chroma values, one can identify the set of chroma values with the set of integers {1,2,...,12}, where 1 refers to chroma C, 2 to C♯, and so on. A pitch class is defined as the set of all pitches that share the same chroma. For example, using the scientific pitch notation, the pitch class corresponding to the chroma C is the set
consisting of all pitches separated by an integer number of octaves. Given a music representation (e.g. a musical score or an audio recording), the main idea of chroma features is to aggregate for a given local time window (e.g. specified in beats or in seconds) all information that relates to a given chroma into a single coefficient. Shifting the time window across the music representation results in a sequence of chroma features each expressing how the representation's pitch content within the time window is spread over the twelve chroma bands. The resulting time-chroma representation is also referred to as chromagram. The figure above shows chromagrams for a C-major scale, once obtained from a musical score and once from an audio recording. Because of the close relation between the terms chroma and pitch class, chroma features are also referred to as pitch class profiles.
Identifying pitches that differ by an octave, chroma features show a high degree of robustness to variations in timbre and closely correlate to the musical aspect of harmony. This is the reason why chroma features are a well-established tool for processing and analyzing music data. [2] For example, basically every chord recognition procedure relies on some kind of chroma representation. [3] [4] [5] [6] Also, chroma features have become the de facto standard for tasks such as music alignment and synchronization [7] [8] as well as audio structure analysis. [9] Finally, chroma features have turned out to be a powerful mid-level feature representation in content-based audio retrieval such as cover song identification, [10] [11] audio matching [12] [13] [14] [15] or audio hashing. [16] [17]
There are many ways for converting an audio recording into a chromagram. For example, the conversion of an audio recording into a chroma representation (or chromagram) may be performed either by using short-time Fourier transforms in combination with binning strategies [18] [19] [20] or by employing suitable multirate filter banks. [12] Furthermore, the properties of chroma features can be significantly changed by introducing suitable pre- and post-processing steps modifying spectral, temporal, and dynamical aspects. This leads to a large number of chroma variants, which may show a quite different behavior in the context of a specific music analysis scenario. [21]
Music information retrieval (MIR) is the interdisciplinary science of retrieving information from music. Those involved in MIR may have a background in academic musicology, psychoacoustics, psychology, signal processing, informatics, machine learning, optical music recognition, computational intelligence, or some combination of these.
A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. When applied to an audio signal, spectrograms are sometimes called sonographs, voiceprints, or voicegrams. When the data are represented in a 3D plot they may be called waterfall displays.
In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data — indeed, any data that can be turned into a one-dimensional sequence can be analyzed with DTW. A well-known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching applications.
Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.
A multipoint control unit (MCU) is a device commonly used to bridge videoconferencing connections.
Thomas Shi-Tao Huang was a Chinese-born Taiwanese-American computer scientist and electrical engineer. He was a researcher and professor emeritus at the University of Illinois at Urbana-Champaign (UIUC). Huang was one of the leading figures in computer vision, pattern recognition and human computer interaction.
Matching pursuit (MP) is a sparse approximation algorithm which finds the "best matching" projections of multidimensional data onto the span of an over-complete dictionary . The basic idea is to approximately represent a signal from Hilbert space as a weighted sum of finitely many functions taken from . An approximation with atoms has the form
SpeechBot was a web search engine for streaming media content developed at Compaq's research laboratories in Cambridge, MA and Australia. Compaq launched the website at Streaming Media West 1999 in San Jose, CA. The internet radio shows indexed by SpeechBot included The Motley Fool, Fresh Air, Talk of the Nation, The Dr. Laura Program, and Dreamland with Art Bell. By June 2003, the service had indexed over 17,000 hours of multimedia content. The website was taken offline in 2005, after HP closed their Cambridge research lab.
Warped linear predictive coding is a variant of linear predictive coding in which the spectral representation of the system is modified, for example by replacing the unit delays used in an LPC implementation with first-order all-pass filters. This can have advantages in reducing the bitrate required for a given level of perceived audio quality/intelligibility, especially in wideband audio coding.
Computer audition (CA) or machine listening is the general field of study of algorithms and systems for audio interpretation by machines. Since the notion of what it means for a machine to "hear" is very broad and somewhat vague, computer audition attempts to bring together several disciplines that originally dealt with specific problems or had a concrete application in mind. The engineer Paris Smaragdis, interviewed in Technology Review, talks about these systems — "software that uses sound to locate people moving through rooms, monitor machinery for impending breakdowns, or activate traffic cameras to record accidents."
Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term ‘audio mining’ is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and information retrieval. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words.
In data analysis, the self-similarity matrix is a graphical representation of similar sequences in a data series.
Computational musicology is an interdisciplinary research area between musicology and computer science. Computational musicology includes any disciplines that use computation in order to study music. It includes sub-disciplines such as mathematical music theory, computer music, systematic musicology, music information retrieval, digital musicology, sound and music computing, and music informatics. As this area of research is defined by the tools that it uses and its subject matter, research in computational musicology intersects with both the humanities and the sciences. The use of computers in order to study and analyze music generally began in the 1960s, although musicians have been using computers to assist them in the composition of music beginning in the 1950s. Today, computational musicology encompasses a wide range of research topics dealing with the multiple ways music can be represented.
Harmonic pitch class profiles (HPCP) is a group of features that a computer program extracts from an audio signal, based on a pitch class profile—a descriptor proposed in the context of a chord recognition system. HPCP are an enhanced pitch distribution feature that are sequences of feature vectors that, to a certain extent, describe tonality, measuring the relative intensity of each of the 12 pitch classes of the equal-tempered scale within an analysis frame. Often, the twelve pitch spelling attributes are also referred to as chroma and the HPCP features are closely related to what is called chroma features or chromagrams.
The International Society for Music Information Retrieval (ISMIR) is an international forum for research on the organization of music-related data. It started as an informal group steered by an ad hoc committee in 2000 which established a yearly symposium - whence "ISMIR", which meant International Symposium on Music Information Retrieval. It was turned into a conference in 2002 while retaining the acronym. ISMIR was incorporated in Canada on July 4, 2008.
Peter Balazs is an Austrian mathematician working at the Acoustics Research Institute Vienna of the Austrian Academy of Sciences.
Spectral shape analysis relies on the spectrum of the Laplace–Beltrami operator to compare and analyze geometric shapes. Since the spectrum of the Laplace–Beltrami operator is invariant under isometries, it is well suited for the analysis or retrieval of non-rigid shapes, i.e. bendable objects such as humans, animals, plants, etc.
An audio coding format is a content representation format for storage or transmission of digital audio. Examples of audio coding formats include MP3, AAC, Vorbis, FLAC, and Opus. A specific software or hardware implementation capable of audio compression and decompression to/from a specific audio coding format is called an audio codec; an example of an audio codec is LAME, which is one of several different codecs which implements encoding and decoding audio in the MP3 audio coding format in software.
Music can be described and represented in many different ways including sheet music, symbolic representations, and audio recordings. For each of these representations, there may exist different versions that correspond to the same musical work. The general goal of music alignment is to automatically link the various data streams, thus interrelating the multiple information sets related to a given musical work. More precisely, music alignment is taken to mean a procedure which, for a given position in one representation of a piece of music, determines the corresponding position within another representation. In the figure on the right, such an alignment is visualized by the red bidirectional arrows. Such synchronization results form the basis for novel interfaces that allow users to access, search, and browse musical content in a convenient way.
Patricia Scanlon is an Irish technologist and businesswoman. She founded SoapBox Labs in 2013, a company that applies artificial intelligence to develop speech recognition applications that are specifically tuned to children's voices. Scanlon was CEO of SoapBox Labs from its founding until May 2021, when she became executive chair. In 2022, Scanlon was appointed by the Irish Government as Ireland’s first Artificial Intelligence Ambassador. In this role, she will "lead a national conversation" about the role of AI in people's lives, including its benefits and risks.