Audio inpainting

Last updated

Audio inpainting (also known as audio interpolation) is an audio restoration task which deals with the reconstruction of missing or corrupted portions of a digital audio signal. [1] Inpainting techniques are employed when parts of the audio have been lost due to various factors such as transmission errors, data corruption or errors during recording. [2]

Contents

The goal of audio inpainting is to fill in the gaps (i.e., the missing portions) in the audio signal seamlessly, making the reconstructed portions indistinguishable from the original content and avoiding the introduction of audible distortions or alterations. [3]

Many techniques have been proposed to solve the audio inpainting problem and this is usually achieved by analyzing the temporal [1] [4] [5] and spectral [3] [2] information surrounding each missing portion of the considered audio signal.

Corrupted spectrogram (top) and its reconstruction after performing audio inpainting (bottom) Corrupted and reconstructed spectrogram.png
Corrupted spectrogram (top) and its reconstruction after performing audio inpainting (bottom)

Classic methods employ statistical models or digital signal processing algorithms [1] [4] [5] to predict and synthesize the missing or damaged sections. Recent solutions, instead, take advantage of deep learning models, thanks to the growing trend of exploiting data-driven methods in the context of audio restoration. [3] [2] [6]

Depending on the extent of the lost information, the inpainting task can be divided in three categories. Short inpainting refers to the reconstruction of few milliseconds (approximately less than 10) of missing signal, that occurs in the case of short distortions such as clicks or clipping. [7] In this case, the goal of the reconstruction is to recover the lost information exactly. In long inpainting instead, with gaps in the order of hundreds of milliseconds or even seconds, this goal becomes unrealistic, since restoration techniques cannot rely on local information. [8] Therefore, besides providing a coherent reconstruction, the algorithms need to generate new information that has to be semantically compatible with the surrounding context (i.e., the audio signal surrounding the gaps). [3] The case of medium duration gaps lays between short and long inpainting. It refers to the reconstruction of tens of millisecond of missing data, a scale where the non-stationary characteristic of audio already becomes important. [9]

Definition

Consider a digital audio signal . A corrupted version of , which is the audio signal presenting missing gaps to be reconstructed, can be defined as , where is a binary mask encoding the reliable or missing samples of , and represents the element-wise product. [2] Audio inpainting aims at finding (i.e., the reconstruction), which is an estimation of . This is an ill-posed inverse problem, which is characterized by a non-unique set of solutions. [2] For this reason, similarly to the formulation used for the inpainting problem in other domains, [10] [11] [12] the reconstructed audio signal can be found through an optimization problem that is formally expressed as

.

In particular, is the optimal reconstructed audio signal and is a distance measure term that computes the reconstruction accuracy between the corrupted audio signal and the estimated one. [10] For example, this term can be expressed with a mean squared error or similar metrics.

Since is computed only on the reliable frames, there are many solutions that can minimize . It is thus necessary to add a constraint to the minimization, in order to restrict the results only to the valid solutions. [12] [11] This is expressed through the regularization term that is computed on the reconstructed audio signal . This term encodes some kind of a-priori information on the audio data. For example, can express assumptions on the stationarity of the signal, on the sparsity of its representation or can be learned from data. [12] [11]

Techniques

There exist various techniques to perform audio inpainting. These can vary significantly, influenced by factors such as the specific application requirements, the length of the gaps and the available data. [3] In the literature, these techniques are broadly divided in model-based techniques (sometimes also referred as signal processing techniques) [3] and data-driven techniques. [2]

Model-based techniques

Model-based techniques involve the exploitation of mathematical models or assumptions about the underlying structure of the audio signal. These models can be based on prior knowledge of the audio content or statistical properties observed in the data. By leveraging these models, missing or corrupted portions of the audio signal can be inferred or estimated. [1]

An example of a model-based techniques are autoregressive models. [5] [13] These methods interpolate or extrapolate the missing samples based on the neighboring values, by using mathematical functions to approximate the missing data. In particular, in autoregressive models the missing samples are completed through linear prediction. [14] The autoregressive coefficients necessary for this prediction are learned from the surrounding audio data, specifically from the data adjacent to each gap. [5] [13]

Some more recent techniques approach audio inpainting by representing audio signals as sparse linear combinations of a limited number of basis functions (as for example in the Short Time Fourier Transform). [1] [15] In this context, the aim is to find the sparse representation of the missing section of the signal that most accurately matches the surrounding, unaffected signal. [1]

The aforementioned methods exhibit optimal performance when applied to filling in relatively short gaps, lasting only a few tens of milliseconds, and thus they can be included in the context of short inpainting. However, these signal-processing techniques tend to struggle when dealing with longer gaps. [2] The reason behind this limitation lies in the violation of the stationarity condition, as the signal often undergoes significant changes after the gap, making it substantially different from the signal preceding the gap. [2]

As a way to overcome these limitations, some approaches add strong assumptions also about the fundamental structure of the gap itself, exploiting sinusoidal modeling [16] or similarity graphs [8] to perform inpainting of longer missing portions of audio signals.

Data-driven techniques

Data-driven techniques rely on the analysis and exploitation of the available audio data. These techniques often employ deep learning algorithms that learn patterns and relationships directly from the provided data. They involve training models on large datasets of audio examples, allowing them to capture the statistical regularities present in the audio signals. Once trained, these models can be used to generate missing portions of the audio signal based on the learned representations, without being restricted by stationarity assumptions. [3] Data-driven techniques also offer the advantage of adaptability and flexibility, as they can learn from diverse audio datasets and potentially handle complex inpainting scenarios. [3]

As of today, such techniques constitute the state-of-the-art of audio inpainting, being able to reconstruct gaps of hundreds of milliseconds or even seconds. These performances are made possible by the use of generative models that have the capability to generate novel content to fill in the missing portions. For example, generative adversarial networks, which are the state-of-the-art of generative models in many areas, rely on two competing neural networks trained simultaneously in a two-player minmax game: the generator produces new data from samples of a random variable, the discriminator attempts to distinguish between generated and real data. [17] During the training, the generator's objective is to fool the discriminator, while the discriminator attempts to learn to better classify real and fake data. [17]

In GAN-based inpaniting methods the generator acts as a context encoder and produces a plausible completion for the gap only given the available information surrounding it. [3] The discriminator is used to train the generator and tests the consistency of the produced inpainted audio. [3]

Recently, also diffusion models have established themselves as the state-of-the-art of generative models in many fields, often beating even GAN-based solutions. For this reason they have also been used to solve the audio inpainting problem, obtaining valid results. [2] These models generate new data instances by inverting the diffusion process, where data samples are progressively transformed into Gaussian noise. [2]

One drawback of generative models is that they typically need a huge amount of training data. This is necessary to make the network generalize well and make it able to produce coherent audio information, that also presents some kind of structural complexity. [6] Nonetheless, some works demonstrated that, capturing the essence of an audio signal is also possible using only a few tens of seconds from a single training sample. [6] [18] [19] This is done by overfitting a generative neural network to a single training audio signal. In this way, researchers were able to perform audio inpainting without exploiting large datasets. [6] [19]

Applications

Audio inpainting finds applications in a wide range of fields, including audio restoration and audio forensics among the others. In these fields, audio inpainting can be used to eliminate noise, glitches, or undesired distortions from an audio recording, thus enhancing its quality and intelligibility. It can also be employed to recover deteriorated old recordings that have been affected by local modifications or have missing audio samples due to scratches on CDs. [2]

Audio inpainting is also closely related to packet loss concealment (PLC). In the PLC problem, it is necessary to compensate the loss of audio packets in communication networks. While both problems aim at filling missing gaps in an audio signal, PLC has more computation time restrictions and only the packets preceding a gap are considered to be reliable (the process is said to be causal). [20] [2]

See also

Related Research Articles

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.

<span class="mw-page-title-main">Iterative reconstruction</span>

Iterative reconstruction refers to iterative algorithms used to reconstruct 2D and 3D images in certain imaging techniques. For example, in computed tomography an image must be reconstructed from projections of an object. Here, iterative reconstruction techniques are usually a better, but computationally more expensive alternative to the common filtered back projection (FBP) method, which directly calculates the image in a single reconstruction step. In recent research works, scientists have shown that extremely fast computations and massive parallelism is possible for iterative reconstruction, which makes iterative reconstruction practical for commercialization.

Non-negative matrix factorization, also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

<span class="mw-page-title-main">MUSIC (algorithm)</span> Algorithm used for frequency estimation and radio direction finding

MUSIC is an algorithm used for frequency estimation and radio direction finding.

Compressed sensing is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions to underdetermined linear systems. This is based on the principle that, through optimization, the sparsity of a signal can be exploited to recover it from far fewer samples than required by the Nyquist–Shannon sampling theorem. There are two conditions under which recovery is possible. The first one is sparsity, which requires the signal to be sparse in some domain. The second one is incoherence, which is applied through the isometric property, which is sufficient for sparse signals. Compressed sensing has applications in, for example, MRI where the incoherence condition is typically satisfied.

In statistical signal processing, the goal of spectral density estimation (SDE) or simply spectral estimation is to estimate the spectral density of a signal from a sequence of time samples of the signal. Intuitively speaking, the spectral density characterizes the frequency content of the signal. One purpose of estimating the spectral density is to detect any periodicities in the data, by observing peaks at the frequencies corresponding to these periodicities.

<span class="mw-page-title-main">MIMO</span> Use of multiple antennas in radio

In radio, multiple-input and multiple-output (MIMO) is a method for multiplying the capacity of a radio link using multiple transmission and receiving antennas to exploit multipath propagation. MIMO has become an essential element of wireless communication standards including IEEE 802.11n, IEEE 802.11ac, HSPA+ (3G), WiMAX, and Long Term Evolution (LTE). More recently, MIMO has been applied to power-line communication for three-wire installations as part of the ITU G.hn standard and of the HomePlug AV2 specification.

Coordinate descent is an optimization algorithm that successively minimizes along coordinate directions to find the minimum of a function. At each iteration, the algorithm determines a coordinate or coordinate block via a coordinate selection rule, then exactly or inexactly minimizes over the corresponding coordinate hyperplane while fixing all other coordinates or coordinate blocks. A line search along the coordinate direction can be performed at the current iterate to determine the appropriate step size. Coordinate descent is applicable in both differentiable and derivative-free contexts.

In the mathematical fields of numerical analysis and approximation theory, box splines are piecewise polynomial functions of several variables. Box splines are considered as a multivariate generalization of basis splines (B-splines) and are generally used for multivariate approximation/interpolation. Geometrically, a box spline is the shadow (X-ray) of a hypercube projected down to a lower-dimensional space. Box splines and simplex splines are well studied special cases of polyhedral splines which are defined as shadows of general polytopes.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

In communications technology, the technique of compressed sensing (CS) may be applied to the processing of speech signals under certain conditions. In particular, CS can be used to reconstruct a sparse vector from a smaller number of measurements, provided the signal can be represented in sparse domain. "Sparse domain" refers to a domain in which only a few measurements have non-zero values.

<span class="mw-page-title-main">Point-set registration</span> Process of finding a spatial transformation that aligns two point clouds

In computer vision, pattern recognition, and robotics, point-set registration, also known as point-cloud registration or scan matching, is the process of finding a spatial transformation that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model, and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging.

Sparse dictionary learning is a representation learning method which aims at finding a sparse representation of the input data in the form of a linear combination of basic elements as well as those basic elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal, and they may be an over-complete spanning set. This problem setup also allows the dimensionality of the signals being represented to be higher than the one of the signals being observed. The above two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal but also provide an improvement in sparsity and flexibility of the representation.

<span class="mw-page-title-main">Generative adversarial network</span> Deep learning method

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

Data augmentation is a statistical technique which allows maximum likelihood estimation from incomplete data. Data augmentation has important applications in Bayesian analysis, and the technique is widely used in machine learning to reduce overfitting when training machine learning models, achieved by training models on several slightly-modified copies of existing data.

<span class="mw-page-title-main">L1-norm principal component analysis</span>

L1-norm principal component analysis (L1-PCA) is a general method for multivariate data analysis. L1-PCA is often preferred over standard L2-norm principal component analysis (PCA) when the analyzed data may contain outliers.

An energy-based model (EBM) (also called a Canonical Ensemble Learning(CEL) or Learning via Canonical Ensemble (LCE)) is an application of canonical ensemble formulation of statistical physics for learning from data problems. The approach prominently appears in generative models (GMs).

The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN). Unlike the earlier inception score (IS), which evaluates only the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images. The FID metric does not completely replace the IS metric. Classifiers that achieve the best (lowest) FID score tend to have greater sample variety while classifiers achieving the best (highest) IS score tend to have better quality within individual images.

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks (DNN) are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

References

  1. 1 2 3 4 5 6 Mokrý, Ondřej; Rajmic, Pavel (2020). "Audio Inpainting: Revisited and Reweighted". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 28: 2906–2918. arXiv: 2001.02480 . doi:10.1109/TASLP.2020.3030486. S2CID   210064378.
  2. 1 2 3 4 5 6 7 8 9 10 11 12 Moliner, Eloi (2024). "Diffusion-Based Audio Inpainting". Journal of the Audio Engineering Society. 72 (3): 100–113. arXiv: 2305.15266 . doi:10.17743/jaes.2022.0129.
  3. 1 2 3 4 5 6 7 8 9 10 Marafioti, Andres; Majdak, Piotr; Holighaus, Nicki; Perraudin, Nathanael (January 2021). "GACELA: A Generative Adversarial Context Encoder for Long Audio Inpainting of Music". IEEE Journal of Selected Topics in Signal Processing. 15 (1): 120–131. arXiv: 2005.05032 . Bibcode:2021ISTSP..15..120M. doi:10.1109/JSTSP.2020.3037506. S2CID   218581410.
  4. 1 2 Adler, Amir; Emiya, Valentin; Jafari, Maria G.; Elad, Michael; Gribonval, Rémi; Plumbley, Mark D. (March 2012). "Audio Inpainting". IEEE Transactions on Audio, Speech, and Language Processing. 20 (3): 922–932. doi:10.1109/TASL.2011.2168211. S2CID   11136245.
  5. 1 2 3 4 Janssen, A.; Veldhuis, R.; Vries, L. (April 1986). "Adaptive interpolation of discrete-time signals that can be modeled as autoregressive processes" (PDF). IEEE Transactions on Acoustics, Speech, and Signal Processing. 34 (2): 317–330. doi:10.1109/TASSP.1986.1164824. S2CID   17149340.
  6. 1 2 3 4 Greshler, Gal; Shaham, Tamar; Michaeli, Tomer (2021). "Catch-A-Waveform: Learning to Generate Audio from a Single Short Example". Advances in Neural Information Processing Systems. 34. Curran Associates, Inc.: 20916–20928. arXiv: 2106.06426 .
  7. Applications of digital signal processing to audio and acoustics (6. Pr ed.). Boston, Mass.: Kluwer. 2003. pp. 133–194. ISBN   978-0-7923-8130-3.
  8. 1 2 Perraudin, Nathanael; Holighaus, Nicki; Majdak, Piotr; Balazs, Peter (June 2018). "Inpainting of Long Audio Segments With Similarity Graphs". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 26 (6): 1083–1094. arXiv: 1607.06667 . doi:10.1109/TASLP.2018.2809864. S2CID   3532979.
  9. Marafioti, Andres; Perraudin, Nathanael; Holighaus, Nicki; Majdak, Piotr (December 2019). "A Context Encoder For Audio Inpainting". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 27 (12): 2362–2372. doi:10.1109/TASLP.2019.2947232. S2CID   53102801.
  10. 1 2 Ulyanov, Dmitry; Vedaldi, Andrea; Lempitsky, Victor (1 July 2020). "Deep Image Prior". International Journal of Computer Vision. 128 (7): 1867–1888. arXiv: 1711.10925 . doi:10.1007/s11263-020-01303-4. S2CID   4531078.
  11. 1 2 3 Pezzoli, Mirco; Perini, Davide; Bernardini, Alberto; Borra, Federico; Antonacci, Fabio; Sarti, Augusto (January 2022). "Deep Prior Approach for Room Impulse Response Reconstruction". Sensors. 22 (7): 2710. Bibcode:2022Senso..22.2710P. doi: 10.3390/s22072710 . PMC   9003306 . PMID   35408325.
  12. 1 2 3 Kong, Fantong; Picetti, Francesco; Lipari, Vincenzo; Bestagini, Paolo; Tang, Xiaoming; Tubaro, Stefano (2022). "Deep Prior-Based Unsupervised Reconstruction of Irregularly Sampled Seismic Data". IEEE Geoscience and Remote Sensing Letters. 19: 1–5. Bibcode:2022IGRSL..1944455K. doi:10.1109/LGRS.2020.3044455. hdl: 11311/1201461 . S2CID   234970208.
  13. 1 2 Etter, W. (May 1996). "Restoration of a discrete-time signal segment by interpolation based on the left-sided and right-sided autoregressive parameters". IEEE Transactions on Signal Processing. 44 (5): 1124–1135. Bibcode:1996ITSP...44.1124E. doi:10.1109/78.502326.
  14. O'Shaughnessy, D. (February 1988). "Linear predictive coding". IEEE Potentials. 7 (1): 29–32. doi:10.1109/45.1890. S2CID   12786562.
  15. Mokry, Ondrej; Zaviska, Pavel; Rajmic, Pavel; Vesely, Vitezslav (September 2019). "Introducing SPAIN (SParse Audio INpainter)". 2019 27th European Signal Processing Conference (EUSIPCO). pp. 1–5. arXiv: 1810.13137 . doi:10.23919/EUSIPCO.2019.8902560. ISBN   978-9-0827-9703-9. S2CID   53109833.
  16. Lagrange, Mathieu; Marchand, Sylvain; Rault, Jean-bernard (15 October 2005). "Long Interpolation of Audio Signals Using Linear Prediction in Sinusoidal Modeling". Journal of the Audio Engineering Society. 53 (10): 891–905.
  17. 1 2 Goodfellow, Ian; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). Generative Adversarial Nets. Vol. 27. Curran Associates, Inc.
  18. Tian, Yapeng; Xu, Chenliang; Li, Dingzeyu (2019). "Deep Audio Prior". arXiv: 1912.10292 [cs.SD].
  19. 1 2 Turetzky, Arnon; Michelson, Tzvi; Adi, Yossi; Peleg, Shmuel (18 September 2022). "Deep Audio Waveform Prior". Interspeech 2022: 2938–2942. arXiv: 2207.10441 . doi:10.21437/Interspeech.2022-10735. S2CID   250920681.
  20. Diener, Lorenz; Sootla, Sten; Branets, Solomiya; Saabas, Ando; Aichner, Robert; Cutler, Ross (18 September 2022). "INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge". Interspeech 2022. pp. 580–584. arXiv: 2204.05222 . doi:10.21437/Interspeech.2022-10829.