Silence compression

Last updated

Silence compression is an audio processing technique used to effectively encode silent intervals, reducing the amount of storage or bandwidth needed to transmit audio recordings.

Contents

Overview

Silence can be defined as audio segments with negligible sound. Examples of silence are pauses between words or sentences in speech and pauses between notes in music. By compressing the silent intervals, the audio files become smaller and easier to handle, store, and send while still retaining the original sound quality. While techniques vary, silence compression is generally achieved through two crucial steps: detection of the silent intervals and the subsequent compression of those intervals. Applications of silence compression include telecommunications, audio streaming, voice recognition, audio archiving, and media production. [1]

Techniques

1. Trimming

Trimming is a method of silence compression in which the silent intervals are removed altogether. This is done by identifying audio intervals below a certain amplitude threshold, indicating silence, and removing that interval from the audio. A drawback of trimming is that it permanently changes the original audio and can cause noticeable artifacts when the audio is played back. [1]

a. Amplitude Threshold Trimming

Amplitude threshold trimming removes silence through the setting of an amplitude threshold in which any audio segments that fall below this threshold are considered silent and are truncated or completely removed. Some common amplitude threshold trimming algorithms are:[ citation needed ]

  • Fixed Threshold: In a fixed threshold approach, a static amplitude level is selected, and any audio segments that fall below this threshold are removed. A drawback to this approach is that it can be difficult to choose an appropriate fixed threshold, due to differences in recording conditions and audio sources.[ citation needed ]
  • Dynamic Threshold: In a dynamic threshold approach, an algorithm is applied to adjust the threshold dynamically based on audio characteristics. An example algorithm is setting the threshold as a fraction of the average amplitude in a given window. This approach allows for more adaptability when dealing with varying audio sources but requires more processing complexity.[ citation needed ]

b. Energy-Based Trimming

Energy-based trimming works through the analysis of an audio signal's energy levels. The energy level of an audio signal is the magnitude of the signal over a short time interval. A common formula to calculate the audio's energy is , where is the energy of the signal, is the samples within the audio signal, and is the th sample's signal amplitude. Once the energy levels are calculated, a threshold is set in which all energy levels that fall below the threshold are considered to be silent and removed. Energy-based trimming can detect silence more accurately than amplitude-based trimming as it considers the overall power output of the audio as opposed to just the amplitude of the sound wave. Energy-based trimming is often used for voice/speech files due to the need to only store and transmit the relevant portions that contain sound. Some popular energy-based trimming algorithms include the Short-Time Energy (STE) and Zero Crossing Rate (ZCR) methods. [2] Similarly, those algorithms are also used in voice activity detection (VAD) to detect speech activity. [1] [3]

2. Silence Suppression

Silence suppression is a technique used within the context of Voice over IP (VoIP) and audio streaming to optimize the rate of data transfer. Through the temporary reduction of data in silent intervals, Audio can be broadcast over the internet in real-time more efficiently. [1] [3]

a. Discontinuous Transmission (DTX)

DTX works to optimize bandwidth usage during real-time telecommunications by detecting silent intervals and suspending the transmission of those intervals. Through continuously monitoring the audio signal, DTX algorithms can detect silence based on predefined criteria. When silence is detected, a signal is sent to the receiver which stops the transmission of audio data. When speech/sound is resumed, audio transmission is reactivated. This technique allows for uninterrupted communication while being highly efficient in the use of network resources. [1] [3]

3. Silence Encoding

Silence Encoding is essential for the efficient representation of silent intervals without the removal of silence altogether. This allows for the minimization of data needed to encode and transmit silence while upholding the audio signal's integrity. [4] [5] [6] There are several encoding methods used for this purpose:

a. Run-Length Encoding (RLE)

RLE works to detect repeating identical samples in the audio and encodes those samples in a way that is more space-efficient. Rather than storing each identical sample individually, RLE stores a single sample and keeps count of how many times it repeats. RLE works well in encoding silence as silent intervals often consist of repeated sequences of identical samples. The reduction of identical samples stored subsequently reduces the size of the audio signal. [4] [5]

b. Huffman Coding

Huffman coding is an entropy encoding method and variable-length code algorithm that assigns more common values with shorter binary codes that require fewer bits to store. Huffman coding works in the context of silence compression by assigning frequently occurring silence patterns with shorter binary codes, reducing data size. [5] [6]

4. Differential Encoding

Differential encoding makes use of the similarity between consecutive audio samples during silent intervals by storing only the difference between samples. Differential encoding is used to efficiently encode the transitions between sound and silence and is useful for audio samples where silence is interspersed with active sound. [7] [8] [9] Some differential encoding algorithms include:

a. Delta Modulation

Delta modulation quantizes and encodes differences between consecutive audio samples by encoding the derivative of the audio sample's amplitude. By storing how the audio signal changes over time rather than the samples itself, the transition from silence to sound can be captured efficiently. Delta modulation typically uses a one-bit quantization mechanism, where 1 indicates an increase in the sample size and 0 indicates a decrease. While this allows for efficient use of bandwidth or storage, it is unable to provide high-fidelity encoding of low-amplitude signals. [8]

b. Delta-Sigma Modulation

Delta-Sigma modulation is a more advanced variant of Delta modulation which allows for high-fidelity encodings for low-amplitude signals. This is done through quantizing at a high oversampling rate, allowing for a precise encoding of slight changes in the audio signal. Delta-sigma modulation is used in situations where maintaining a high audio fidelity is prioritized. [9]

Applications

The reduction of audio size from silence compression has uses in numerous applications:

  1. Telecommunications: The reduction of silent transmissions in telecommunication systems such as VoIP allows for more efficient bandwidth use and reduced data costs.
  2. Audio Streaming: silence compression minimizes data usage during audio streaming, allowing for high-quality audio to be broadcast efficiently over the internet.
  3. Audio Archiving: silence compression helps to conserve space needed to store audio while maintaining audio fidelity.

Related Research Articles

In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.

Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner are a sequence of numbers that represent samples of a continuous variable in a domain such as time, space, or frequency. In digital electronics, a digital signal is represented as a pulse train, which is typically generated by the switching of a transistor.

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

<span class="mw-page-title-main">MP3</span> Digital audio format

MP3 is a coding format for digital audio developed largely by the Fraunhofer Society in Germany under the lead of Karlheinz Brandenburg, with support from other digital scientists in other countries. Originally defined as the third audio format of the MPEG-1 standard, it was retained and further extended — defining additional bit-rates and support for more audio channels — as the third audio format of the subsequent MPEG-2 standard. A third version, known as MPEG-2.5 — extended to better support lower bit rates — is commonly implemented, but is not a recognized standard.

Run-length encoding (RLE) is a form of lossless data compression in which runs of data are stored as a single data value and count, rather than as the original run. This is most efficient on data that contains many such runs, for example, simple graphic images such as icons, line drawings, Conway's Game of Life, and animations. For files that do not have many runs, RLE could increase the file size.

Speech coding is an application of data compression to digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.

Vector quantization (VQ) is a classical quantization technique from signal processing that allows the modeling of probability density functions by the distribution of prototype vectors. Developed in the early 1980s by Robert M. Gray, it was originally used for data compression. It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms. In simpler terms, vector quantization chooses a set of points to represent a larger set of points.

<span class="mw-page-title-main">Digital audio</span> Technology that records, stores, and reproduces sound

Digital audio is a representation of sound recorded in, or converted into, digital form. In digital audio, the sound wave of the audio signal is typically encoded as numerical samples in a continuous sequence. For example, in CD audio, samples are taken 44,100 times per second, each with 16-bit sample depth. Digital audio is also the name for the entire technology of sound recording and reproduction using audio signals that have been encoded in digital form. Following significant advances in digital audio technology during the 1970s and 1980s, it gradually replaced analog audio technology in many areas of audio engineering, record production and telecommunications in the 1990s and 2000s.

A discrete cosine transform (DCT) expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. The DCT, first proposed by Nasir Ahmed in 1972, is a widely used transformation technique in signal processing and data compression. It is used in most digital media, including digital images, digital video, digital audio, digital television, digital radio, and speech coding. DCTs are also important to numerous other applications in science and engineering, such as digital signal processing, telecommunication devices, reducing network bandwidth usage, and spectral methods for the numerical solution of partial differential equations.

bzip2 File compression software

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.

<span class="mw-page-title-main">Pulse-width modulation</span> Electric signal modulation technique used to control power

Pulse-width modulation (PWM), also known as pulse-duration modulation (PDM) or pulse-length modulation (PLM), is a method of controlling the average power or amplitude delivered by an electrical signal. The average value of voltage fed to the load is controlled by switching the supply between 0 and 100% at a rate faster than it takes the load to change significantly. The longer the switch is on, the higher the total power supplied to the load. Along with maximum power point tracking (MPPT), it is one of the primary methods of controlling the output of solar panels to that which can be utilized by a battery. PWM is particularly suited for running inertial loads such as motors, which are not as easily affected by this discrete switching. The goal of PWM is to control a load; however, the PWM switching frequency must be selected carefully in order to smoothly do so.

<span class="mw-page-title-main">G.711</span> ITU-T recommendation

G.711 is a narrowband audio codec originally designed for use in telephony that provides toll-quality audio at 64 kbit/s. It is an ITU-T standard (Recommendation) for audio encoding, titled Pulse code modulation (PCM) of voice frequencies released for use in 1972.

<span class="mw-page-title-main">Quantization (signal processing)</span> Process of mapping a continuous set to a countable set

Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set to output values in a (countable) smaller set, often with a finite number of elements. Rounding and truncation are typical examples of quantization processes. Quantization is involved to some degree in nearly all digital signal processing, as the process of representing a signal in digital form ordinarily involves rounding. Quantization also forms the core of essentially all lossy compression algorithms.

Voice activity detection (VAD), also known as speech activity detection or speech detection, is the detection of the presence or absence of human speech, used in speech processing. The main uses of VAD are in speaker diarization, speech coding and speech recognition. It can facilitate speech processing, and can also be used to deactivate some processes during non-speech section of an audio session: it can avoid [[[Silence compression|unnecessary coding]]/transmission of silence packets in Voice over Internet Protocol (VoIP) applications, saving on computation and on network bandwidth.

Electric power quality is the degree to which the voltage, frequency, and waveform of a power supply system conform to established specifications. Good power quality can be defined as a steady supply voltage that stays within the prescribed range, steady AC frequency close to the rated value, and smooth voltage curve waveform. In general, it is useful to consider power quality as the compatibility between what comes out of an electric outlet and the load that is plugged into it. The term is used to describe electric power that drives an electrical load and the load's ability to function properly. Without the proper power, an electrical device may malfunction, fail prematurely or not operate at all. There are many ways in which electric power can be of poor quality, and many more causes of such poor quality power.

Bandwidth extension of signal is defined as the deliberate process of expanding the frequency range (bandwidth) of a signal in which it contains an appreciable and useful content, and/or the frequency range in which its effects are such. Its significant advancement in recent years has led to the technology being adopted commercially in several areas including psychacoustic bass enhancement of small loudspeakers and the high frequency enhancement of coded speech and audio.

<span class="mw-page-title-main">Sub-band coding</span>

In signal processing, sub-band coding (SBC) is any form of transform coding that breaks a signal into a number of different frequency bands, typically by using a fast Fourier transform, and encodes each one independently. This decomposition is often the first step in data compression for audio and video signals.

Pulse-code modulation (PCM) is a method used to digitally represent sampled analog signals. It is the standard form of digital audio in computers, compact discs, digital telephony and other digital audio applications. In a PCM stream, the amplitude of the analog signal is sampled at uniform intervals, and each sample is quantized to the nearest value within a range of digital steps.

A video coding format is a content representation format of digital video content, such as in a data file or bitstream. It typically uses a standardized video compression algorithm, most commonly based on discrete cosine transform (DCT) coding and motion compensation. A specific software, firmware, or hardware implementation capable of compression or decompression in a specific video coding format is called a video codec.

<span class="mw-page-title-main">Audio coding format</span> Digitally coded format for audio signals

An audio coding format is a content representation format for storage or transmission of digital audio. Examples of audio coding formats include MP3, AAC, Vorbis, FLAC, and Opus. A specific software or hardware implementation capable of audio compression and decompression to/from a specific audio coding format is called an audio codec; an example of an audio codec is LAME, which is one of several different codecs which implements encoding and decoding audio in the MP3 audio coding format in software.

References

  1. 1 2 3 4 5 Benyassine, A.; Shlomot, E.; Su, H.-Y.; Massaloux, D.; Lamblin, C.; Petit, J.-P. (1997). "ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications". IEEE Communications Magazine. 35 (9): 64–73. doi:10.1109/35.620527 . Retrieved 2023-11-09.
  2. Sahin, Arda; Unlu, Mehmet Zubeyir (2021-01-20). "Speech file compression by eliminating unvoiced/silence components". Sustainable Engineering and Innovation. 3 (1): 11–14. doi: 10.37868/sei.v3i1.119 . ISSN   2712-0562. S2CID   234125634.
  3. 1 2 3 "On the ITU-T G.729.1 silence compression scheme". ieeexplore.ieee.org. Retrieved 2023-11-09.
  4. 1 2 Elsayed, Hend A. (2014). "Burrows-Wheeler Transform and combination of Move-to-Front coding and Run Length Encoding for lossless audio coding". 2014 9th International Conference on Computer Engineering & Systems (ICCES). pp. 354–359. doi:10.1109/ICCES.2014.7030985. ISBN   978-1-4799-6594-6. S2CID   15743605 . Retrieved 2023-11-09.
  5. 1 2 3 Patil, Rupali B.; Kulat, K. D. (2017). "Audio compression using dynamic Huffman and RLE coding". 2017 2nd International Conference on Communication and Electronics Systems (ICCES). pp. 160–162. doi:10.1109/CESYS.2017.8321256. ISBN   978-1-5090-5013-0. S2CID   4122679 . Retrieved 2023-11-09.
  6. 1 2 Firmansah, Luthfi; Setiawan, Erwin Budi (2016). "Data audio compression lossless FLAC format to lossy audio MP3 format with Huffman Shift Coding algorithm". 2016 4th International Conference on Information and Communication Technology (ICoICT). pp. 1–5. doi:10.1109/ICoICT.2016.7571951. ISBN   978-1-4673-9879-4. S2CID   18754681 . Retrieved 2023-11-09.
  7. Jensen, J.; Heusdens, R. (2003). "A comparison of differential schemes for low-rate sinusoidal audio coding". 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684). pp. 205–208. doi:10.1109/ASPAA.2003.1285867. ISBN   0-7803-7850-4. S2CID   58213603 . Retrieved 2023-11-09.
  8. 1 2 Zhu, Y.S.; Leung, S.W.; Wong, C.M. (1996). "A digital audio processing system based on nonuniform sampling delta modulation". IEEE Transactions on Consumer Electronics. 42: 80–86. doi:10.1109/30.485464 . Retrieved 2023-11-09.
  9. 1 2 "Sigma-delta modulation for audio DSP". ieeexplore.ieee.org. Retrieved 2023-11-09.