Audio inpainting

Last updated

Audio inpainting (also known as audio interpolation) is an audio restoration task which deals with the reconstruction of missing or corrupted portions of a digital audio signal. [1] Inpainting techniques are employed when parts of the audio have been lost due to various factors such as transmission errors, data corruption or errors during recording. [2]

Contents

The goal of audio inpainting is to fill in the gaps (i.e., the missing portions) in the audio signal seamlessly, making the reconstructed portions indistinguishable from the original content and avoiding the introduction of audible distortions or alterations. [3]

Many techniques have been proposed to solve the audio inpainting problem and this is usually achieved by analyzing the temporal [1] [4] [5] and spectral [3] [2] information surrounding each missing portion of the considered audio signal.

Corrupted spectrogram (top) and its reconstruction after performing audio inpainting (bottom) Corrupted and reconstructed spectrogram.png
Corrupted spectrogram (top) and its reconstruction after performing audio inpainting (bottom)

Classic methods employ statistical models or digital signal processing algorithms [1] [4] [5] to predict and synthesize the missing or damaged sections. Recent solutions, instead, take advantage of deep learning models, thanks to the growing trend of exploiting data-driven methods in the context of audio restoration. [3] [2] [6]

Depending on the extent of the lost information, the inpaintining task can be divided in three categories. Short inpainting refers to the reconstruction of few milliseconds (approximately less than 10) of missing signal, that occurs in the case of short distortions such as clicks or clipping. [7] In this case, the goal of the reconstruction is to recover the lost information exactly. In long inpainting instead, with gaps in the order of hundreds of milliseconds or even seconds, this goal becomes unrealistic, since restoration techniques cannot rely on local information. [8] Therefore, besides providing a coherent reconstruction, the algorithms need to generate new information that has to be semantically compatible with the surrounding context (i.e., the audio signal surrounding the gaps). [3] The case of medium duration gaps lays between short and long inpainting. It refers to the reconstruction of tens of millisecond of missing data, a scale where the non-stationary characteristic of audio already becomes important. [9]

Definition

Consider a digital audio signal . A corrupted version of , which is the audio signal presenting missing gaps to be reconstructed, can be defined as , where is a binary mask encoding the reliable or missing samples of , and represents the element-wise product. [2] Audio inpainting aims at finding (i.e., the reconstruction), which is an estimation of . This is an ill-posed inverse problem, which is characterized by a non-unique set of solutions. [2] For this reason, similarly to the formulation used for the inpainting problem in other domains, [10] [11] [12] the reconstructed audio signal can be found through an optimization problem that is formally expressed as

.

In particular, is the optimal reconstructed audio signal and is a distance measure term that computes the reconstruction accuracy between the corrupted audio signal and the estimated one. [10] For example, this term can be expressed with a mean squared error or similar metrics.

Since is computed only on the reliable frames, there are many solutions that can minimize . It is thus necessary to add a constraint to the minimization, in order to restrict the results only to the valid solutions. [12] [11] This is expressed through the regularization term that is computed on the reconstructed audio signal . This term encodes some kind of a-priori information on the audio data. For example, can express assumptions on the stationarity of the signal, on the sparsity of its representation or can be learned from data. [12] [11]

Techniques

There exist various techniques to perform audio inpainting. These can vary significantly, influenced by factors such as the specific application requirements, the length of the gaps and the available data. [3] In the literature, these techniques are broadly divided in model-based techniques (sometimes also referred as signal processing techniques) [3] and data-driven techniques. [2]

Model-based techniques

Model-based techniques involve the exploitation of mathematical models or assumptions about the underlying structure of the audio signal. These models can be based on prior knowledge of the audio content or statistical properties observed in the data. By leveraging these models, missing or corrupted portions of the audio signal can be inferred or estimated. [1]

An example of a model-based techniques are autoregressive models. [5] [13] These methods interpolate or extrapolate the missing samples based on the neighboring values, by using mathematical functions to approximate the missing data. In particular, in autoregressive models the missing samples are completed through linear prediction. [14] The autoregressive coefficients necessary for this prediction are learned from the surrounding audio data, specifically from the data adjacent to each gap. [5] [13]

Some more recent techniques approach audio inpainting by representing audio signals as sparse linear combinations of a limited number of basis functions (as for example in the Short Time Fourier Transform). [1] [15] In this context, the aim is to find the sparse representation of the missing section of the signal that most accurately matches the surrounding, unaffected signal. [1]

The aforementioned methods exhibit optimal performance when applied to filling in relatively short gaps, lasting only a few tens of milliseconds, and thus they can be included in the context of short inpainting. However, these signal-processing techniques tend to struggle when dealing with longer gaps. [2] The reason behind this limitation lies in the violation of the stationarity condition, as the signal often undergoes significant changes after the gap, making it substantially different from the signal preceding the gap. [2]

As a way to overcome these limitations, some approaches add strong assumptions also about the fundamental structure of the gap itself, exploiting sinusoidal modeling [16] or similarity graphs [8] to perform inpainting of longer missing portions of audio signals.

Data-driven techniques

Data-driven techniques rely on the analysis and exploitation of the available audio data. These techniques often employ deep learning algorithms that learn patterns and relationships directly from the provided data. They involve training models on large datasets of audio examples, allowing them to capture the statistical regularities present in the audio signals. Once trained, these models can be used to generate missing portions of the audio signal based on the learned representations, without being restricted by stationarity assumptions. [3] Data-driven techniques also offer the advantage of adaptability and flexibility, as they can learn from diverse audio datasets and potentially handle complex inpainting scenarios. [3]

As of today, such techniques constitute the state-of-the-art of audio inpainting, being able to reconstruct gaps of hundreds of milliseconds or even seconds. These performances are made possible by the use of generative models that have the capability to generate novel content to fill in the missing portions. For example, generative adversarial networks, which are the state-of-the-art of generative models in many areas, rely on two competing neural networks trained simultaneously in a two-player minmax game: the generator produces new data from samples of a random variable, the discriminator attempts to distinguish between generated and real data. [17] During the training, the generator's objective is to fool the discriminator, while the discriminator attempts to learn to better classify real and fake data. [17]

In GAN-based inpaniting methods the generator acts as a context encoder and produces a plausible completion for the gap only given the available information surrounding it. [3] The discriminator is used to train the generator and tests the consistency of the produced inpainted audio. [3]

Recently, also diffusion models have established themselves as the state-of-the-art of generative models in many fields, often beating even GAN-based solutions. For this reason they have also been used to solve the audio inpainting problem, obtaining valid results. [2] These models generate new data instances by inverting the diffusion process, where data samples are progressively transformed into Gaussian noise. [2]

One drawback of generative models is that they typically need a huge amount of training data. This is necessary to make the network generalize well and make it able to produce coherent audio information, that also presents some kind of structural complexity. [6] Nonetheless, some works demonstrated that, capturing the essence of an audio signal is also possible using only a few tens of seconds from a single training sample. [6] [18] [19] This is done by overfitting a generative neural network to a single training audio signal. In this way, researchers were able to perform audio inpainting without exploiting large datasets. [6] [19]

Applications

Audio inpainting finds applications in a wide range of fields, including audio restoration and audio forensics among the others. In these fields, audio inpainting can be used to eliminate noise, glitches, or undesired distortions from an audio recording, thus enhancing its quality and intelligibility. It can also be employed to recover deteriorated old recordings that have been affected by local modifications or have missing audio samples due to scratches on CDs. [2]

Audio inpainting is also closely related to packet loss concealment (PLC). In the PLC problem, it is necessary to compensate the loss of audio packets in communication networks. While both problems aim at filling missing gaps in an audio signal, PLC has more computation time restrictions and only the packets preceding a gap are considered to be reliable (the process is said to be causal). [20] [2]

See also

Related Research Articles

<span class="mw-page-title-main">Iterative reconstruction</span>

Iterative reconstruction refers to iterative algorithms used to reconstruct 2D and 3D images in certain imaging techniques. For example, in computed tomography an image must be reconstructed from projections of an object. Here, iterative reconstruction techniques are usually a better, but computationally more expensive alternative to the common filtered back projection (FBP) method, which directly calculates the image in a single reconstruction step. In recent research works, scientists have shown that extremely fast computations and massive parallelism is possible for iterative reconstruction, which makes iterative reconstruction practical for commercialization.

Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one. In the context of learning, this may be the selection of a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice.

<span class="mw-page-title-main">Non-negative matrix factorization</span> Algorithms for matrix decomposition

Non-negative matrix factorization, also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically.

<span class="mw-page-title-main">Autoencoder</span> Neural network that learns efficient data encoding in an unsupervised manner

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

<span class="mw-page-title-main">MUSIC (algorithm)</span> Algorithm used for frequency estimation and radio direction finding

MUSIC is an algorithm used for frequency estimation and radio direction finding.

Compressed sensing is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions to underdetermined linear systems. This is based on the principle that, through optimization, the sparsity of a signal can be exploited to recover it from far fewer samples than required by the Nyquist–Shannon sampling theorem. There are two conditions under which recovery is possible. The first one is sparsity, which requires the signal to be sparse in some domain. The second one is incoherence, which is applied through the isometric property, which is sufficient for sparse signals.

<span class="mw-page-title-main">Inpainting</span> Conservation process to fill in damaged, deteriorated, or missing parts of an artwork

Inpainting is a conservation process where damaged, deteriorated, or missing parts of an artwork are filled in to present a complete image. This process is commonly used in image restoration. It can be applied to both physical and digital art mediums such as oil or acrylic paintings, chemical photographic prints, sculptures, or digital images and video.

In statistical signal processing, the goal of spectral density estimation (SDE) or simply spectral estimation is to estimate the spectral density of a signal from a sequence of time samples of the signal. Intuitively speaking, the spectral density characterizes the frequency content of the signal. One purpose of estimating the spectral density is to detect any periodicities in the data, by observing peaks at the frequencies corresponding to these periodicities.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is part of a broader family of machine learning methods, which is based on artificial neural networks with representation learning. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

In the mathematical fields of numerical analysis and approximation theory, box splines are piecewise polynomial functions of several variables. Box splines are considered as a multivariate generalization of basis splines (B-splines) and are generally used for multivariate approximation/interpolation. Geometrically, a box spline is the shadow (X-ray) of a hypercube projected down to a lower-dimensional space. Box splines and simplex splines are well studied special cases of polyhedral splines which are defined as shadows of general polytopes.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

In communications technology, the technique of compressed sensing (CS) may be applied to the processing of speech signals under certain conditions. In particular, CS can be used to reconstruct a sparse vector from a smaller number of measurements, provided the signal can be represented in sparse domain. "Sparse domain" refers to a domain in which only a few measurements have non-zero values.

<span class="mw-page-title-main">Point-set registration</span>

In computer vision, pattern recognition, and robotics, point-set registration, also known as point-cloud registration or scan matching, is the process of finding a spatial transformation that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model, and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging.

<span class="mw-page-title-main">Sparse dictionary learning</span> Representation learning method

Sparse dictionary learning is a representation learning method which aims at finding a sparse representation of the input data in the form of a linear combination of basic elements as well as those basic elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal, and they may be an over-complete spanning set. This problem setup also allows the dimensionality of the signals being represented to be higher than the one of the signals being observed. The above two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal but also provide an improvement in sparsity and flexibility of the representation.

<span class="mw-page-title-main">Generative adversarial network</span> Deep learning method

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

<span class="mw-page-title-main">Data augmentation</span> Data analysis technique

Data augmentation is a technique in machine learning used to reduce overfitting when training a machine learning model, by training models on several slightly-modified copies of existing data.

Energy-based generative neural networks is a class of generative models, which aim to learn explicit probability distributions of data in the form of energy-based models whose energy functions are parameterized by modern deep neural networks. Its name is due to the fact that this model can be derived from the discriminative neural networks. The parameter of the neural network in this model is trained in a generative manner by Markov chain Monte Carlo(MCMC)-based maximum likelihood estimation. The learning process follows an ''analysis by synthesis'' scheme, where within each learning iteration, the algorithm samples the synthesized examples from the current model by a gradient-based MCMC method, e.g., Langevin dynamics, and then updates the model parameters based on the difference between the training examples and the synthesized ones. This process can be interpreted as an alternating mode seeking and mode shifting process, and also has an adversarial interpretation. The first energy-based generative neural network is the generative ConvNet proposed in 2016 for image patterns, where the neural network is a convolutional neural network. The model has been generalized to various domains to learn distributions of videos, and 3D voxels. They are made more effective in their variants. They have proven useful for data generation, data recovery, data reconstruction.

An audio deepfake is a type of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.

<span class="mw-page-title-main">Self-supervised learning</span> A paradigm in machine learning

Self-supervised learning (SSL) is a paradigm in machine learning for processing data of lower quality, rather than improving ultimate outcomes. Self-supervised learning more closely imitates the way humans learn to classify objects.

<span class="mw-page-title-main">Deep learning speech synthesis</span> Method of speech synthesis that uses deep neural networks

Deep learning speech synthesis uses Deep Neural Networks (DNN) to produce artificial speech from text (text-to-speech) or spectrum (vocoder). The deep neural networks are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

References

  1. 1 2 3 4 5 6 Mokrý, Ondřej; Rajmic, Pavel (2020). "Audio Inpainting: Revisited and Reweighted". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 28: 2906–2918. arXiv: 2001.02480 . doi:10.1109/TASLP.2020.3030486. S2CID   210064378.
  2. 1 2 3 4 5 6 7 8 9 10 11 12 Moliner, Eloi (2023). "Diffusion-Based Audio Inpainting". arXiv: 2305.15266 [eess.AS].
  3. 1 2 3 4 5 6 7 8 9 10 Marafioti, Andres; Majdak, Piotr; Holighaus, Nicki; Perraudin, Nathanael (January 2021). "GACELA: A Generative Adversarial Context Encoder for Long Audio Inpainting of Music". IEEE Journal of Selected Topics in Signal Processing. 15 (1): 120–131. arXiv: 2005.05032 . Bibcode:2021ISTSP..15..120M. doi:10.1109/JSTSP.2020.3037506. S2CID   218581410.
  4. 1 2 Adler, Amir; Emiya, Valentin; Jafari, Maria G.; Elad, Michael; Gribonval, Rémi; Plumbley, Mark D. (March 2012). "Audio Inpainting". IEEE Transactions on Audio, Speech, and Language Processing. 20 (3): 922–932. doi:10.1109/TASL.2011.2168211. S2CID   11136245.
  5. 1 2 3 4 Janssen, A.; Veldhuis, R.; Vries, L. (April 1986). "Adaptive interpolation of discrete-time signals that can be modeled as autoregressive processes" (PDF). IEEE Transactions on Acoustics, Speech, and Signal Processing. 34 (2): 317–330. doi:10.1109/TASSP.1986.1164824.
  6. 1 2 3 4 Greshler, Gal; Shaham, Tamar; Michaeli, Tomer (2021). "Catch-A-Waveform: Learning to Generate Audio from a Single Short Example". Advances in Neural Information Processing Systems. Curran Associates, Inc. 34: 20916–20928. arXiv: 2106.06426 .
  7. Applications of digital signal processing to audio and acoustics (6. Pr ed.). Boston, Mass.: Kluwer. 2003. pp. 133–194. ISBN   978-0-7923-8130-3.
  8. 1 2 Perraudin, Nathanael; Holighaus, Nicki; Majdak, Piotr; Balazs, Peter (June 2018). "Inpainting of Long Audio Segments With Similarity Graphs". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 26 (6): 1083–1094. arXiv: 1607.06667 . doi:10.1109/TASLP.2018.2809864. S2CID   3532979.
  9. Marafioti, Andres; Perraudin, Nathanael; Holighaus, Nicki; Majdak, Piotr (December 2019). "A Context Encoder For Audio Inpainting". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 27 (12): 2362–2372. doi:10.1109/TASLP.2019.2947232. S2CID   53102801.
  10. 1 2 Ulyanov, Dmitry; Vedaldi, Andrea; Lempitsky, Victor (1 July 2020). "Deep Image Prior". International Journal of Computer Vision. 128 (7): 1867–1888. arXiv: 1711.10925 . doi:10.1007/s11263-020-01303-4. S2CID   4531078.
  11. 1 2 3 Pezzoli, Mirco; Perini, Davide; Bernardini, Alberto; Borra, Federico; Antonacci, Fabio; Sarti, Augusto (January 2022). "Deep Prior Approach for Room Impulse Response Reconstruction". Sensors. 22 (7): 2710. Bibcode:2022Senso..22.2710P. doi: 10.3390/s22072710 . PMC   9003306 . PMID   35408325.
  12. 1 2 3 Kong, Fantong; Picetti, Francesco; Lipari, Vincenzo; Bestagini, Paolo; Tang, Xiaoming; Tubaro, Stefano (2022). "Deep Prior-Based Unsupervised Reconstruction of Irregularly Sampled Seismic Data". IEEE Geoscience and Remote Sensing Letters. 19: 1–5. Bibcode:2022IGRSL..1944455K. doi:10.1109/LGRS.2020.3044455. hdl:11311/1201461. S2CID   234970208.
  13. 1 2 Etter, W. (May 1996). "Restoration of a discrete-time signal segment by interpolation based on the left-sided and right-sided autoregressive parameters". IEEE Transactions on Signal Processing. 44 (5): 1124–1135. Bibcode:1996ITSP...44.1124E. doi:10.1109/78.502326.
  14. O'Shaughnessy, D. (February 1988). "Linear predictive coding". IEEE Potentials. 7 (1): 29–32. doi:10.1109/45.1890. S2CID   12786562.
  15. Mokry, Ondrej; Zaviska, Pavel; Rajmic, Pavel; Vesely, Vitezslav (September 2019). "Introducing SPAIN (SParse Audio INpainter)". European Signal Processing Conference (EUSIPCO): 1–5. arXiv: 1810.13137 . doi:10.23919/EUSIPCO.2019.8902560. ISBN   978-9-0827-9703-9. S2CID   53109833.
  16. Lagrange, Mathieu; Marchand, Sylvain; Rault, Jean-bernard (15 October 2005). "Long Interpolation of Audio Signals Using Linear Prediction in Sinusoidal Modeling". Journal of the Audio Engineering Society. 53 (10): 891–905.
  17. 1 2 Goodfellow, Ian; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). Generative Adversarial Nets. Vol. 27. Curran Associates, Inc.
  18. Tian, Yapeng; Xu, Chenliang; Li, Dingzeyu (2019). "Deep Audio Prior". arXiv: 1912.10292 [cs.SD].
  19. 1 2 Turetzky, Arnon; Michelson, Tzvi; Adi, Yossi; Peleg, Shmuel (18 September 2022). "Deep Audio Waveform Prior". Interspeech 2022: 2938–2942. arXiv: 2207.10441 . doi:10.21437/Interspeech.2022-10735. S2CID   250920681.
  20. Diener, Lorenz; Sootla, Sten; Branets, Solomiya; Saabas, Ando; Aichner, Robert; Cutler, Ross (18 September 2022). "INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge". Interspeech 2022. pp. 580–584. arXiv: 2204.05222 . doi:10.21437/Interspeech.2022-10829.