Chroma feature

Last updated
(a) Musical score of a C-major scale. (b) Chromagram obtained from the score. (c) Audio recording of the C-major scale played on a piano. (d) Chromagram obtained from the audio recording. ChromaFeatureCmajorScaleScoreAudioColor.png
(a) Musical score of a C-major scale. (b) Chromagram obtained from the score. (c) Audio recording of the C-major scale played on a piano. (d) Chromagram obtained from the audio recording.

In Western music, the term chroma feature or chromagram closely relates to the twelve different pitch classes. Chroma-based features, which are also referred to as "pitch class profiles", are a powerful tool for analyzing music whose pitches can be meaningfully categorized (often into twelve categories) and whose tuning approximates to the equal-tempered scale. One main property of chroma features is that they capture harmonic and melodic characteristics of music, while being robust to changes in timbre and instrumentation.

Contents

Definition

The underlying observation is that humans perceive two musical pitches as similar in color if they differ by an octave. Based on this observation, a pitch can be separated into two components, which are referred to as tone height and chroma. [1] Assuming the equal-tempered scale, one considers twelve chroma values represented by the set

{C, C, D, D, E , F, F, G, G, A, A, B}

that consists of the twelve pitch spelling attributes as used in Western music notation. Note that in the equal-tempered scale different pitch spellings such C and D refer to the same chroma. Enumerating the chroma values, one can identify the set of chroma values with the set of integers {1,2,...,12}, where 1 refers to chroma C, 2 to C, and so on. A pitch class is defined as the set of all pitches that share the same chroma. For example, using the scientific pitch notation, the pitch class corresponding to the chroma C is the set

{..., C−2, C−1, C0, C1, C2, C3 ...}

consisting of all pitches separated by an integer number of octaves. Given a music representation (e.g. a musical score or an audio recording), the main idea of chroma features is to aggregate for a given local time window (e.g. specified in beats or in seconds) all information that relates to a given chroma into a single coefficient. Shifting the time window across the music representation results in a sequence of chroma features each expressing how the representation's pitch content within the time window is spread over the twelve chroma bands. The resulting time-chroma representation is also referred to as chromagram. The figure above shows chromagrams for a C-major scale, once obtained from a musical score and once from an audio recording. Because of the close relation between the terms chroma and pitch class, chroma features are also referred to as pitch class profiles.

Applications

Identifying pitches that differ by an octave, chroma features show a high degree of robustness to variations in timbre and closely correlate to the musical aspect of harmony. This is the reason why chroma features are a well-established tool for processing and analyzing music data. [2] For example, basically every chord recognition procedure relies on some kind of chroma representation. [3] [4] [5] [6] Also, chroma features have become the de facto standard for tasks such as music alignment and synchronization [7] [8] as well as audio structure analysis. [9] Finally, chroma features have turned out to be a powerful mid-level feature representation in content-based audio retrieval such as cover song identification, [10] [11] audio matching [12] [13] [14] [15] or audio hashing. [16] [17]

Computation of audio chromagrams

There are many ways for converting an audio recording into a chromagram. For example, the conversion of an audio recording into a chroma representation (or chromagram) may be performed either by using short-time Fourier transforms in combination with binning strategies [18] [19] [20] or by employing suitable multirate filter banks. [12] Furthermore, the properties of chroma features can be significantly changed by introducing suitable pre- and post-processing steps modifying spectral, temporal, and dynamical aspects. This leads to a large number of chroma variants, which may show a quite different behavior in the context of a specific music analysis scenario. [21]

See also

Related Research Articles

Audio signal processing is a subfield of signal processing that is concerned with the electronic manipulation of audio signals. Audio signals are electronic representations of sound waves—longitudinal waves which travel through air, consisting of compressions and rarefactions. The energy contained in audio signals is typically measured in decibels. As audio signals may be represented in either digital or analog format, processing may occur in either domain. Analog processors operate directly on the electrical signal, while digital processors operate mathematically on its digital representation.

Idiolect is an individual's unique use of language, including speech. This unique usage encompasses vocabulary, grammar, and pronunciation. This differs from a dialect, a common set of linguistic characteristics shared among a group of people.

<span class="mw-page-title-main">Spectrogram</span> Visual representation of the spectrum of frequencies of a signal as it varies with time

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. When applied to an audio signal, spectrograms are sometimes called sonographs, voiceprints, or voicegrams. When the data are represented in a 3D plot they may be called waterfall displays.

In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

<span class="mw-page-title-main">Dynamic time warping</span> An algorithm for measuring similarity between two temporal sequences, which may vary in speed

In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data — indeed, any data that can be turned into a one-dimensional sequence can be analyzed with DTW. A well-known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching applications.

<span class="mw-page-title-main">Automatic image annotation</span>

Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

<span class="mw-page-title-main">Thomas Huang</span> Chinese-American engineer and computer scientist (1936–2020)

Thomas Shi-Tao Huang was a Chinese-born American computer scientist, electrical engineer, and writer. He was a researcher and professor emeritus at the University of Illinois at Urbana-Champaign (UIUC). Huang was one of the leading figures in computer vision, pattern recognition and human computer interaction.

Warped linear predictive coding is a variant of linear predictive coding in which the spectral representation of the system is modified, for example by replacing the unit delays used in an LPC implementation with first-order all-pass filters. This can have advantages in reducing the bitrate required for a given level of perceived audio quality/intelligibility, especially in wideband audio coding.

Computer audition (CA) or machine listening is the general field of study of algorithms and systems for audio interpretation by machines. Since the notion of what it means for a machine to "hear" is very broad and somewhat vague, computer audition attempts to bring together several disciplines that originally dealt with specific problems or had a concrete application in mind. The engineer Paris Smaragdis, interviewed in Technology Review, talks about these systems — "software that uses sound to locate people moving through rooms, monitor machinery for impending breakdowns, or activate traffic cameras to record accidents."

Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term ‘audio mining’ is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and information retrieval. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words.

In data analysis, the self-similarity matrix is a graphical representation of similar sequences in a data series.

Harmonic pitch class profiles (HPCP) is a group of features that a computer program extracts from an audio signal, based on a pitch class profile—a descriptor proposed in the context of a chord recognition system. HPCP are an enhanced pitch distribution feature that are sequences of feature vectors that, to a certain extent, describe tonality, measuring the relative intensity of each of the 12 pitch classes of the equal-tempered scale within an analysis frame. Often, the twelve pitch spelling attributes are also referred to as chroma and the HPCP features are closely related to what is called chroma features or chromagrams.

<span class="mw-page-title-main">International Society for Music Information Retrieval</span>

The International Society for Music Information Retrieval (ISMIR) is an international forum for research on the organization of music-related data. It started as an informal group steered by an ad hoc committee in 2000 which established a yearly symposium - whence "ISMIR", which meant International Symposium on Music Information Retrieval. It was turned into a conference in 2002 while retaining the acronym. ISMIR was incorporated in Canada on July 4, 2008.

<span class="mw-page-title-main">Peter Balazs (mathematician)</span> Austrian mathematician

Peter Balazs is an Austrian mathematician working at the Acoustics Research Institute Vienna of the Austrian Academy of Sciences.

<span class="mw-page-title-main">Audio coding format</span> Digitally coded format for audio signals

An audio coding format is a content representation format for storage or transmission of digital audio. Examples of audio coding formats include MP3, AAC, Vorbis, FLAC, and Opus. A specific software or hardware implementation capable of audio compression and decompression to/from a specific audio coding format is called an audio codec; an example of an audio codec is LAME, which is one of several different codecs which implements encoding and decoding audio in the MP3 audio coding format in software.

<span class="mw-page-title-main">Music alignment</span>

Music can be described and represented in many different ways including sheet music, symbolic representations, and audio recordings. For each of these representations, there may exist different versions that correspond to the same musical work. The general goal of music alignment is to automatically link the various data streams, thus interrelating the multiple information sets related to a given musical work. More precisely, music alignment is taken to mean a procedure which, for a given position in one representation of a piece of music, determines the corresponding position within another representation. In the figure on the right, such an alignment is visualized by the red bidirectional arrows. Such synchronization results form the basis for novel interfaces that allow users to access, search, and browse musical content in a convenient way.

<span class="mw-page-title-main">Generative audio</span> Creation of audio files from databases of audio clips

Generative audio refers to the creation of audio files from databases of audio clips. This technology differs from AI voices such as Apple's Siri or Amazon's Alexa, which use a collection of fragments that are stitched together on demand.

<span class="mw-page-title-main">Patricia Scanlon</span> Irish entrepreneur

Patricia Scanlon is an Irish technologist and businesswoman. She founded SoapBox Labs in 2013, a company that applies artificial intelligence to develop speech recognition applications that are specifically tuned to children's voices. Scanlon was CEO of SoapBox Labs from its founding until May 2021, when she became executive chair. In 2022, Scanlon was appointed by the Irish Government as Ireland’s first Artificial Intelligence Ambassador. In this role, she will "lead a national conversation" about the role of AI in people's lives, including its benefits and risks.

Edward Y. Chang is a computer scientist, academic, and author. He is an adjunct professor of Computer Science at Stanford University, and Visiting Chair Professor of Bioinformatics and Medical Engineering at Asia University, since 2019.

References

  1. Shepard, Roger N. (1964). "Circularity in judgments of relative pitch". Journal of the Acoustical Society of America. 36 (212): 2346–2353. Bibcode:1964ASAJ...36.2346S. doi:10.1121/1.1919362.
  2. Müller, Meinard (2015). Fundamentals of Music Processing. Springer. doi:10.1007/978-3-319-21945-5. ISBN   978-3-319-21944-8. S2CID   8691186.
  3. Cho, Taemin; Bello, Juan Pablo (2014). "On the Relative Importance of Individual Components of Chord Recognition Systems". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 22 (2): 477–4920. doi:10.1109/TASLP.2013.2295926. S2CID   16434636.
  4. Mauch, Matthias; Dixon, Simon (2010). "Simultaneous estimation of chords and musical context from audio". IEEE Transactions on Audio, Speech, and Language Processing. 18 (6): 138–153. CiteSeerX   10.1.1.414.7800 . doi:10.1109/TASL.2009.2032947. S2CID   15866073.
  5. Fujishima, Takuya (1999). "Realtime Chord Recognition of Musical Sound: a System Using Common Lisp Music". Proceedings of the International Computer Music Conference: 464–467.
  6. Jiang, Nanzhu; Grosche, Peter; Konz, Verena; Müller, Meinard (2011). "Analyzing Chroma Feature Types for Automated Chord Recognition" (PDF). Proceedings of the AES Conference on Semantic Audio.
  7. Hu, Ning; Dannenberg, Roger B.; Tzanetakis, George (2003). "Polyphonic Audio Matching and Alignment for Music Retrieval". Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.
  8. Ewert, Sebastian; Müller, Meinard; Grosche, Peter (2009). "High resolution audio synchronization using chroma onset features" (PDF). 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. pp. 1869–1872. doi:10.1109/ICASSP.2009.4959972. ISBN   978-1-4244-2353-8. S2CID   16952895.
  9. Paulus, Jouni; Müller, Meinard; Klapuri, Anssi (2010). "Audio-based Music Structure Analysis" (PDF). Proceedings of the International Conference on Music Information Retrieval: 625–636.
  10. Ellis, Daniel P.W.; Poliner, Graham (2007). "Identifying 'Cover Songs' with Chroma Features and Dynamic Programming Beat Tracking". Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing.
  11. Serrà, Joan; Gómez, Emilia; Herrera, Perfecto; Serra, Xavier (2008). "Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification". IEEE Transactions on Audio, Speech, and Language Processing. 16 (6): 1138–1151. doi:10.1109/TASL.2008.924595. hdl: 10230/16277 . S2CID   10078274.
  12. 1 2 Müller, Meinard; Kurth, Frank; Clausen, Michael (2005). "Audio Matching via Chroma-Based Statistical Features" (PDF). Proceedings of the International Conference on Music Information Retrieval: 288–295.
  13. Kurth, Frank; Müller, Meinard (2008). "Efficient Index-Based Audio Matching". IEEE Transactions on Audio, Speech, and Language Processing. 16 (2): 382–395. doi:10.1109/TASL.2007.911552. S2CID   206601781.
  14. Müller, Meinard (2015). Music Synchronization. In Fundamentals of Music Processing, chapter 3, pages 115-166. Springer. ISBN   978-3-319-21944-8.
  15. Kurth, Frank; Müller, Meinard (2008). "Efficient Index-Based Audio Matching". IEEE Transactions on Audio, Speech, and Language Processing. 16 (2): 382–395. doi:10.1109/TASL.2007.911552. S2CID   206601781.
  16. Yu, Yi; Crucianu, Michel; Oria, Vincent; Damiani, Ernesto (2010). "Combining multi-probe histogram and order-statistics based LSH for scalable audio content retrieval". Proceedings of the international conference on Multimedia - MM '10. Proceedings of the 18th International Conference on Multimedia 2010. pp. 381–390. doi:10.1145/1873951.1874004. ISBN   9781605589336. S2CID   9033525.
  17. Yu, Yi; Crucianu, Michel; Oria, Vincent; Chen, Lei (2009). "Local summarization and multi-level LSH for retrieving multi-variant audio tracks". Proceedings of the seventeen ACM international conference on Multimedia - MM '09. Proceedings of the 17th International Conference on Multimedia 2009. pp. 341–350. doi:10.1145/1631272.1631320. ISBN   9781605586083. S2CID   816862.
  18. Bartsch, Mark A.; Wakefield, Gregory H. (2005). "Audio thumbnailing of popular music using chroma-based representations". IEEE Transactions on Multimedia. 7 (1): 96–104. CiteSeerX   10.1.1.379.3293 . doi:10.1109/TMM.2004.840597. S2CID   12559221.
  19. Gómez, Emilia (2006). "Tonal Description of Music Audio Signals". PhD Thesis, UPF Barcelona, Spain.
  20. Müller, Meinard (2015). Music Synchronization. In Fundamentals of Music Processing, chapter 3, pages 115-166. Springer. ISBN   978-3-319-21944-8.
  21. Müller, Meinard; Ewert, Sebastian (2011). "Chroma Toolbox: MATLAB Implementations For Extracting Variants of Chroma-Based Audio Features" (PDF). Proceedings of the International Society for Music Information Retrieval Conference: 215–220.