Music alignment

Last updated
First theme of Symphony No. 5 by Ludwig van Beethoven in a sheet music, audio, and piano-roll representation. The red bidirectional arrows indicate the aligned time positions of corresponding note events in the different representations. MusicAlignment BeethovenFifth.png
First theme of Symphony No. 5 by Ludwig van Beethoven in a sheet music, audio, and piano-roll representation. The red bidirectional arrows indicate the aligned time positions of corresponding note events in the different representations.

Music can be described and represented in many different ways including sheet music, symbolic representations, and audio recordings. For each of these representations, there may exist different versions that correspond to the same musical work. The general goal of music alignment (sometimes also referred to as music synchronization) is to automatically link the various data streams, thus interrelating the multiple information sets related to a given musical work. More precisely, music alignment is taken to mean a procedure which, for a given position in one representation of a piece of music, determines the corresponding position within another representation. [1] In the figure on the right, such an alignment is visualized by the red bidirectional arrows. Such synchronization results form the basis for novel interfaces that allow users to access, search, and browse musical content in a convenient way. [2] [3]

Contents

Basic procedure

Overview of the processing pipeline of a typical music alignment procedure. MusicAlignment Procedure.png
Overview of the processing pipeline of a typical music alignment procedure.

Given two different music representations, typical music alignment approaches proceed in two steps. [1] In the first step, the two representations are transformed into sequences of suitable features. In general, such feature representations need to find a compromise between two conflicting goals. On the one hand, features should show a large degree of robustness to variations that are to be left unconsidered for the task at hand. On the other hand, features should capture enough characteristic information to accomplish the given task. For music alignment, one often uses chroma-based features (also called chromagrams or pitch class profiles), which capture harmonic and melodic characteristics of music, while being robust to changes in timbre and instrumentation, are being used.

In the second step, the derived feature sequences have to be brought into (temporal) correspondence. To this end, techniques related to dynamic time warping (DTW) or hidden Markov models (HMMs) are used to compute an optimal alignment between two given feature sequences.

Music alignment and related synchronization tasks have been studied extensively within the field of music information retrieval. In the following, we give some pointers to related tasks. Depending upon the respective types of music representations, one can distinguish between various synchronization scenarios. For example, audio alignment refers to the task of temporally aligning two different audio recordings of a piece of music. Similarly, the goal of score–audio alignment is to coordinate note events given in the score representation with audio data. In the offline scenario, the two data streams to be aligned are known prior to the actual alignment. In this case, one can use global optimization procedures such as dynamic time warping (DTW) to find an optimal alignment. In general, it is harder to deal with scenarios where the data streams are to be processed online. One prominent online scenario is known as score following , where a musician is performing a piece according to a given musical score. The goal is then to identify the currently played musical events depicted in the score with high accuracy and low latency. [4] [5] In this scenario, the score is known as a whole in advance, but the performance is known only up to the current point in time. In this context, alignment techniques such as hidden Markov models or particle filters have been employed, where the current score position and tempo are modeled in a statistical sense. [6] [7] As opposed to classical DTW, such an online synchronization procedure inherently has a running time that is linear in the duration of the performed version. However, as a main disadvantage, an online strategy is very sensitive to local tempo variations and deviations from the score - once the procedure is out of sync, it is very hard to recover and return to the right track. A further online synchronization problem is known as automatic accompaniment . Having a solo part played by a musician, the task of the computer is to accompany the musician according to a given score by adjusting the tempo and other parameters in real time. Such systems were already proposed some decades ago. [8] [9] [10]

Related Research Articles

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model.

Vector quantization (VQ) is a classical quantization technique from signal processing that allows the modeling of probability density functions by the distribution of prototype vectors. It was originally used for data compression. It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms.

<span class="mw-page-title-main">Dynamic time warping</span> An algorithm for measuring similarity between two temporal sequences, which may vary in speed

In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data — indeed, any data that can be turned into a one-dimensional sequence can be analyzed with DTW. A well-known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching applications.

Score following is the process of automatically listening to a live music performance and tracking the position in the score. It is an active area of research and stands at the intersection of artificial intelligence, pattern recognition, signal processing, and musicology. Score following was first introduced in 1984 independently by Barry Vercoe and Roger Dannenberg.

<span class="mw-page-title-main">Recurrent neural network</span> Computational model used in machine learning

A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. Recurrent neural networks are theoretically Turing complete and can run arbitrary programs to process arbitrary sequences of inputs.

Warped linear predictive coding is a variant of linear predictive coding in which the spectral representation of the system is modified, for example by replacing the unit delays used in an LPC implementation with first-order all-pass filters. This can have advantages in reducing the bitrate required for a given level of perceived audio quality/intelligibility, especially in wideband audio coding.

Computer audition (CA) or machine listening is the general field of study of algorithms and systems for audio interpretation by machines. Since the notion of what it means for a machine to "hear" is very broad and somewhat vague, computer audition attempts to bring together several disciplines that originally dealt with specific problems or had a concrete application in mind. The engineer Paris Smaragdis, interviewed in Technology Review, talks about these systems — "software that uses sound to locate people moving through rooms, monitor machinery for impending breakdowns, or activate traffic cameras to record accidents."

Sparse approximation theory deals with sparse solutions for systems of linear equations. Techniques for finding these solutions and exploiting them in applications have found wide use in image processing, signal processing, machine learning, medical imaging, and more.

Music informatics is a study of music processing, in particular music representations, fourier analysis of music, music synchronization, music structure analysis and chord recognition. Other music informatics research topics include computational music modeling, computational music analysis, optical music recognition, digital audio editors, online music search engines, music information retrieval and cognitive issues in music. Because music informatics is an emerging discipline, it is a very dynamic area of research with many diverse viewpoints, whose future is yet to be determined.

Artificial intelligence and music (AIM) is a common subject in the International Computer Music Conference, the Computing Society Conference and the International Joint Conference on Artificial Intelligence. The first International Computer Music Conference (ICMC) was held in 1974 at Michigan State University. Current research includes the application of AI in music composition, performance, theory and digital sound processing.

Speaker adaptation is an important technology to fine-tune either features or speech models for mis-match due to inter-speaker variation. In the last decade, eigenvoice (EV) speaker adaptation has been developed. It makes use of the prior knowledge of training speakers to provide a fast adaptation algorithm. Inspired by the kernel eigenface idea in face recognition, kernel eigenvoice (KEV) is proposed. KEV is a non-linear generalization to EV. This incorporates Kernel principal component analysis, a non-linear version of Principal Component Analysis, to capture higher order correlations in order to further explore the speaker space and enhance recognition performance.

<span class="mw-page-title-main">Bit-reversal permutation</span> Permutation that reverses binary numbers

In applied mathematics, a bit-reversal permutation is a permutation of a sequence of items, where is a power of two. It is defined by indexing the elements of the sequence by the numbers from to , representing each of these numbers by its binary representation, and mapping each item to the item whose representation has the same bits in the reversed order.

Antescofo is a program developed by Arshia Cont in 2007 at IRCAM in collaboration with composer Marco Stroppa to aid with the synchronization of electronics in live performances. It is a modular polyphonic Score Following system as well as a Synchronous Programming language for musical composition. Since 2012, Antescofo is being developed by a joint team between IRCAM and INRIA.

<span class="mw-page-title-main">Tachyon (software)</span>

Tachyon is a parallel/multiprocessor ray tracing software. It is a parallel ray tracing library for use on distributed memory parallel computers, shared memory computers, and clusters of workstations. Tachyon implements rendering features such as ambient occlusion lighting, depth-of-field focal blur, shadows, reflections, and others. It was originally developed for the Intel iPSC/860 by John Stone for his M.S. thesis at University of Missouri-Rolla. Tachyon subsequently became a more functional and complete ray tracing engine, and it is now incorporated into a number of other open source software packages such as VMD, and SageMath. Tachyon is released under a permissive license.

In communications technology, the technique of compressed sensing (CS) may be applied to the processing of speech signals under certain conditions. In particular, CS can be used to reconstruct a sparse vector from a smaller number of measurements, provided the signal can be represented in sparse domain. "Sparse domain" refers to a domain in which only a few measurements have non-zero values.

A recursive neural network is a kind of deep neural network created by applying the same set of weights recursively over a structured input, to produce a structured prediction over variable-size input structures, or a scalar prediction on it, by traversing a given structure in topological order. Recursive neural networks, sometimes abbreviated as RvNNs, have been successful, for instance, in learning sequence and tree structures in natural language processing, mainly phrase and sentence continuous representations based on word embedding. RvNNs have first been introduced to learn distributed representations of structure, such as logical terms. Models and general frameworks have been developed in further works since the 1990s.

<span class="mw-page-title-main">Sparse dictionary learning</span> Representation learning method

Sparse dictionary learning is a representation learning method which aims at finding a sparse representation of the input data in the form of a linear combination of basic elements as well as those basic elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal, and they may be an over-complete spanning set. This problem setup also allows the dimensionality of the signals being represented to be higher than the one of the signals being observed. The above two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal but also provide an improvement in sparsity and flexibility of the representation.

<span class="mw-page-title-main">Chroma feature</span>

In Western music, the term chroma feature or chromagram closely relates to the twelve different pitch classes. Chroma-based features, which are also referred to as "pitch class profiles", are a powerful tool for analyzing music whose pitches can be meaningfully categorized and whose tuning approximates to the equal-tempered scale. One main property of chroma features is that they capture harmonic and melodic characteristics of music, while being robust to changes in timbre and instrumentation.

Steven Glenn Johnson is an American mathematician known for being a co-creator of the FFTW library for software-based fast Fourier transforms and for his work on photonic crystals. He is professor of Applied Mathematics and Physics at MIT where he leads a group on Nanostructures and Computation.

References

  1. 1 2 Müller, Meinard (2015). Music Synchronization. In Fundamentals of Music Processing, chapter 3, pages 115-166. Springer. doi:10.1007/978-3-319-21945-5. ISBN   978-3-319-21944-8. S2CID   8691186.
  2. Damm, David; Fremerey, Christian; Thomas, Verena; Clausen, Michael; Kurth, Frank; Müller, Meinard (2012). "A digital library framework for heterogeneous music collections: from document acquisition to cross-modal interaction". International Journal on Digital Libraries. 12 (2–3): 53–71. doi:10.1007/s00799-012-0087-y. S2CID   254076612.
  3. Müller, Meinard; Clausen, Michael; Konz, Verena; Ewert, Sebastian; Fremerey, Christian (2010). "A Multimodal Way of Experiencing and Exploring Music" (PDF). Interdisciplinary Science Reviews. 35 (2): 138–153. Bibcode:2010ISRv...35..138M. CiteSeerX   10.1.1.400.245 . doi:10.1179/030801810X12723585301110. S2CID   1739507.
  4. Cont, Arshia (2010). "A Coupled Duration-Focused Architecture for Real-Time Music-to-Score Alignment". IEEE Transactions on Pattern Analysis and Machine Intelligence. 32 (6): 974–987. CiteSeerX   10.1.1.192.2305 . doi:10.1109/TPAMI.2009.106. ISSN   0162-8828. PMID   20431125. S2CID   3522344.
  5. Orio, Nicola; Lemouton, Serge; Schwarz, Diemo (2003). "Score following: State of the art and new developments" (PDF). Proceedings of the International Conference on New Interfaces for Musical Expression (NIME): 36–41.
  6. Duan, Zhiyao; Pardo, Bryan (2011). "A state space model for online polyphonic audio-score alignment". 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (PDF). pp. 197–200. doi:10.1109/ICASSP.2011.5946374. ISBN   978-1-4577-0538-0. S2CID   2296185.
  7. Montecchio, Nicola; Cont, Arshia (2011). 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (PDF). pp. 193–196. doi:10.1109/ICASSP.2011.5946373. ISBN   978-1-4577-0538-0. S2CID   6581358.
  8. Dannenberg, Roger B. (1984). "An on-line algorithm for real-time accompaniment" (PDF). Proceedings of the International Computer Music Conference (ICMC): 193–198.
  9. Raphael, Christopher (2001). "A probabilistic expert system for automatic musical accompaniment". Journal of Computational and Graphical Statistics. 10 (3): 487–512. CiteSeerX   10.1.1.20.6559 . doi:10.1198/106186001317115081. S2CID   2505863.
  10. Dannenberg, Roger B.; Raphael, Christopher (2006). "Music score alignment and computer accompaniment" (PDF). Communications of the ACM. 49 (8): 38–43. CiteSeerX   10.1.1.468.2658 . doi:10.1145/1145287.1145311. ISSN   0001-0782. S2CID   207159787.