Mel-frequency cepstrum

Last updated November 14, 2023

In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC.^[1] They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal spectrum. This frequency warping can allow for better representation of sound, for example, in audio compression that might potentially reduce the transmission bandwidth and the storage requirements of audio signals.

MFCCs are commonly derived as follows:^[2]^[3]

Take the Fourier transform of (a windowed excerpt of) a signal.
Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows or alternatively, cosine overlapping windows.
Take the logs of the powers at each of the mel frequencies.
Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
The MFCCs are the amplitudes of the resulting spectrum.

There can be variations on this process, for example: differences in the shape or spacing of the windows used to map the scale,^[4] or addition of dynamics features such as "delta" and "delta-delta" (first- and second-order frame-to-frame difference) coefficients.^[5]

The European Telecommunications Standards Institute in the early 2000s defined a standardised MFCC algorithm to be used in mobile phones.^[6]

Applications

MFCCs are commonly used as features in speech recognition ^[7] systems, such as the systems which can automatically recognize numbers spoken into a telephone.

MFCCs are also increasingly finding uses in music information retrieval applications such as genre classification, audio similarity measures, etc.^[8]

MFCC for speaker recognition

Since Mel-frequency bands are distributed evenly in MFCC and they are much similar to the voice system of a human, MFCC can efficiently be used to characterize speakers. For instance, it can be used to recognize the speaker's cell phone model details and further the details of the speaker.^[4]

This type of mobile device recognition is possible because the production of electronic components in a phone have tolerances, because different electronic circuit realizations do not have exact same transfer functions. The dissimilarities in the transfer function from one realization to another becomes more prominent if the task performing circuits are from different manufacturers. Hence, each cell phone introduces a convolutional distortion on input speech that leaves its unique impact on the recordings from the cell phone. Therefore, a particular phone can be identified from the recorded speech by multiplying the original frequency spectrum with further multiplications of transfer functions specific to each phone followed by signal processing techniques. Thus, by using MFCC one can characterize cell phone recordings to identify the brand and model of the phone.^[5]

Considering recording section of a cellphone as Linear time-invariant (LTI) filter:

Impulse response- h(n), recorded speech signal y(n) as output of filter in response to input x(n).

Hence, $y(n)=x(n)*h(n)$ (convolution)

As speech is not stationary signal, it is divided into overlapped frames within which the signal is assumed to be stationary. So, the $p^{th}$ short-term segment (frame) of recorded input speech is:

y_{p}w(n)=[x(n)w(pW-n)]*h(n)

,

where w(n): windowed function of length W.

Hence, as specified the footprint of mobile phone of the recorded speech is the convolution distortion that helps to identify the recording phone.

The embedded identity of the cell phone requires a conversion to a better identifiable form, hence, taking short-time Fourier transform:

Y_{p}w(f)=X_{p}w(f)H(f)

$H(f)$ can be considered as a concatenated transfer function that produced input speech, and the recorded speech $Y_{p}w(f)$ can be perceived as original speech from cell phone.

So, equivalent transfer function of vocal tract and cell phone recorder is considered as original source of recorded speech. Therefore,

X_{p}w(f)=Xe_{p}w(f)X_{v}(f),H'(f)=H(f)X_{v}(f),

where Xew(f) is the excitation function, $X_{v}(f)$ is the vocal tract transfer function for speech in the $p^{th}$ frame and $H'(f)$ is the equivalent transfer function that characterizes the cell phone.

Y_{p}w(f)=Xe_{p}w(f)H'(f)

This approach can be useful for speaker recognition as the device identification and the speaker identification are very much connected.

Providing importance to the envelope of the spectrum which multiplied by filter bank (suitable cepstrum with mel-scale filter bank), after smoothing filter bank with transfer function U(f), the log operation on output energies are:

log[|Y_{p}w(f)|]=\log[|U(f)||Xe_{p}w(f)||H'(f)|]

Representing $H_{w}(f)=U(f)H'(f)$

\log[|Y_{p}w(f)|]=\log[|Xe_{p}w(f)|]+\log[|H_{w}(f)|]

MFCC is successful because of this nonlinear transformation with additive property.

Transforming back to time domain:

c_{y}(j)=c_{e}(j)+c_{w}(j)

where, cy(j), ce(j), cw(j) are the recorded speech cepstrum and weighted equivalent impulse response of cell phone recorder that characterizes the cell phone, respectively, while j is the number of filters in the filter bank.

More precisely, the device specific information is in the recorded speech which is converted to additive form suitable for identification.

cy(j) can be further processed for identification of the recording phone.

Often used frame lengths- 20 or 20 ms.

Commonly used window functions- Hamming and Hanning windows.

Hence, Mel-scale is a commonly used frequency scale that is linear till 1000 Hz and logarithmic above it.

Computation of central frequencies of filters in Mel-scale:

f_{mel}=1000\log(1+f/1000)/\log 2

, base 10.

Basic procedure for MFCC calculation:

Logarithmic filter bank outputs are produced and multiplied by 20 to obtain spectral envelopes in decibels.
MFCCs are obtained by taking Discrete Cosine Transform (DCT) of the spectral envelope.
Cepstrum coefficients are obtained as:

$ci=\sum _{n=1}^{Nf}{Sn}\ cos[i(n-0.5)\left({\frac {\pi }{Nf}}\right)]$ , i= 1,2,....,L ,

where c_i = c_y(i) = ith MFCC coefficient, N_f is the number of triangular filters in the filter bank, Sn is the log energy output of nth filter coefficient and L is the number of MFCC coefficients that we want to calculate.

Inversion

An MFCC can be approximately inverted to audio in four steps: (a1) inverse DCT to obtain a mel log-power [dB] spectrogram, (a2) mapping to power to obtain a mel power spectrogram, (b1) rescaling to obtain short-time Fourier transform magnitudes, and finally (b2) phase reconstruction and audio synthesis using Griffin-Lim. Each step corresponds to one step in MFCC calculation.^[9]

Noise sensitivity

MFCC values are not very robust in the presence of additive noise, and so it is common to normalise their values in speech recognition systems to lessen the influence of noise. Some researchers propose modifications to the basic MFCC algorithm to improve robustness, such as by raising the log-mel-amplitudes to a suitable power (around 2 or 3) before taking the discrete cosine transform (DCT), which reduces the influence of low-energy components.^[10]

History

Paul Mermelstein^[11]^[12] is typically credited with the development of the MFC. Mermelstein credits Bridle and Brown^[13] for the idea:

Bridle and Brown used a set of 19 weighted spectrum-shape coefficients given by the cosine transform of the outputs of a set of nonuniformly spaced bandpass filters. The filter spacing is chosen to be logarithmic above 1 kHz and the filter bandwidths are increased there as well. We will, therefore, call these the mel-based cepstral parameters.^[11]

Sometimes both early originators are cited.^[14]

Many authors, including Davis and Mermelstein,^[12] have commented that the spectral basis functions of the cosine transform in the MFC are very similar to the principal components of the log spectra, which were applied to speech representation and recognition much earlier by Pols and his colleagues.^[15]^[16]

Related Research Articles

In mathematics, the discrete Fourier transform (DFT) converts a finite sequence of equally-spaced samples of a function into a same-length sequence of equally-spaced samples of the discrete-time Fourier transform (DTFT), which is a complex-valued function of frequency. The interval at which the DTFT is sampled is the reciprocal of the duration of the input sequence. An inverse DFT (IDFT) is a Fourier series, using the DTFT samples as coefficients of complex sinusoids at the corresponding DTFT frequencies. It has the same sample-values as the original input sequence. The DFT is therefore said to be a frequency domain representation of the original input sequence. If the original sequence spans all the non-zero values of a function, its DTFT is continuous, and the DFT provides discrete samples of one cycle. If the original sequence is one cycle of a periodic function, the DFT provides all the non-zero values of one DTFT cycle.

In mathematics, Fourier analysis is the study of the way general functions may be represented or approximated by sums of simpler trigonometric functions. Fourier analysis grew from the study of Fourier series, and is named after Joseph Fourier, who showed that representing a function as a sum of trigonometric functions greatly simplifies the study of heat transfer.

In engineering, a transfer function of a system, sub-system, or component is a mathematical function that models the system's output for each possible input. They are widely used in electronic engineering tools like circuit simulators and control systems. In some simple cases, this function can be represented as two-dimensional graph of an independent scalar input versus the dependent scalar output, called a transfer curve or characteristic curve. Transfer functions for components are used to design and analyze systems assembled from components, particularly using the block diagram technique, in electronics and control theory.

In signal processing, a digital filter is a system that performs mathematical operations on a sampled, discrete-time signal to reduce or enhance certain aspects of that signal. This is in contrast to the other major type of electronic filter, the analog filter, which is typically an electronic circuit operating on continuous-time analog signals.

<span class="mw-page-title-main">Pink noise</span> Signal with equal energy per octave

Pink noise, 1⁄f noise or fractional noise or fractal noise is a signal or process with a frequency spectrum such that the power spectral density is inversely proportional to the frequency of the signal. In pink noise, each octave interval carries an equal amount of noise energy.

In Fourier analysis, the cepstrum is the result of computing the inverse Fourier transform (IFT) of the logarithm of the estimated signal spectrum. The method is a tool for investigating periodic structures in frequency spectra. The power cepstrum has applications in the analysis of human speech.

A wavelet is a wave-like oscillation with an amplitude that begins at zero, increases or decreases, and then returns to zero one or more times. Wavelets are termed a "brief oscillation". A taxonomy of wavelets has been established, based on the number and direction of its pulses. Wavelets are imbued with specific properties that make them useful for signal processing.

<span class="mw-page-title-main">Fourier series</span> Decomposition of periodic functions into sums of simpler sinusoidal forms

A Fourier series is an expansion of a periodic function into a sum of trigonometric functions. The Fourier series is an example of a trigonometric series, but not all trigonometric series are Fourier series. By expressing a function as a sum of sines and cosines, many problems involving the function become easier to analyze because trigonometric functions are well understood. For example, Fourier series were first used by Joseph Fourier to find solutions to the heat equation. This application is possible because the derivatives of trigonometric functions fall into simple patterns. Fourier series cannot be used to approximate arbitrary functions, because most functions have infinitely many terms in their Fourier series, and the series do not always converge. Well-behaved functions, for example smooth functions, have Fourier series that converge to the original function. The coefficients of the Fourier series are determined by integrals of the function multiplied by trigonometric functions, described in Common forms of the Fourier series below.

A discrete cosine transform (DCT) expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. The DCT, first proposed by Nasir Ahmed in 1972, is a widely used transformation technique in signal processing and data compression. It is used in most digital media, including digital images, digital video, digital audio, digital television, digital radio, and speech coding. DCTs are also important to numerous other applications in science and engineering, such as digital signal processing, telecommunication devices, reducing network bandwidth usage, and spectral methods for the numerical solution of partial differential equations.

The modified discrete cosine transform (MDCT) is a transform based on the type-IV discrete cosine transform (DCT-IV), with the additional property of being lapped: it is designed to be performed on consecutive blocks of a larger dataset, where subsequent blocks are overlapped so that the last half of one block coincides with the first half of the next block. This overlapping, in addition to the energy-compaction qualities of the DCT, makes the MDCT especially attractive for signal compression applications, since it helps to avoid artifacts stemming from the block boundaries. As a result of these advantages, the MDCT is the most widely used lossy compression technique in audio data compression. It is employed in most modern audio coding standards, including MP3, Dolby Digital (AC-3), Vorbis (Ogg), Windows Media Audio (WMA), ATRAC, Cook, Advanced Audio Coding (AAC), High-Definition Coding (HDC), LDAC, Dolby AC-4, and MPEG-H 3D Audio, as well as speech coding standards such as AAC-LD (LD-MDCT), G.722.1, G.729.1, CELT, and Opus.

<span class="mw-page-title-main">Window function</span> Function used in signal processing

In signal processing and statistics, a window function is a mathematical function that is zero-valued outside of some chosen interval, normally symmetric around the middle of the interval, usually approaching a maximum in the middle, and usually tapering away from the middle. Mathematically, when another function or waveform/data-sequence is "multiplied" by a window function, the product is also zero-valued outside the interval: all that is left is the part where they overlap, the "view through the window". Equivalently, and in actual practice, the segment of data within the window is first isolated, and then only that data is multiplied by the window function values. Thus, tapering, not segmentation, is the main purpose of window functions.

In signal processing, a finite impulse response (FIR) filter is a filter whose impulse response is of finite duration, because it settles to zero in finite time. This is in contrast to infinite impulse response (IIR) filters, which may have internal feedback and may continue to respond indefinitely.

In mathematics and signal processing, the Hilbert transform is a specific singular integral that takes a function, $u (t)$ of a real variable and produces another function of a real variable $H(u)(t)$ . The Hilbert transform is given by the Cauchy principal value of the convolution with the function $(see § Definition). The Hilbert transform has a particularly simple representation in the frequency domain: It imparts a phase shift of \pm90° (π /2 radians) to every frequency component of a function, the sign of the shift depending on the sign of the frequency (see § Relationship with the Fourier transform). The Hilbert transform is important in signal processing, where it is a component of the analytic representation of a real-valued signal u (t) . The Hilbert transform was first introduced by David Hilbert in this setting, to solve a special case of the Riemann-Hilbert problem for analytic functions.$

Homomorphic filtering is a generalized technique for signal and image processing, involving a nonlinear mapping to a different domain in which linear filter techniques are applied, followed by mapping back to the original domain. This concept was developed in the 1960s by Thomas Stockham, Alan V. Oppenheim, and Ronald W. Schafer at MIT and independently by Bogert, Healy, and Tukey in their study of time series.

In numerical analysis and functional analysis, a discrete wavelet transform (DWT) is any wavelet transform for which the wavelets are discretely sampled. As with other wavelet transforms, a key advantage it has over Fourier transforms is temporal resolution: it captures both frequency and location information.

The Goertzel algorithm is a technique in digital signal processing (DSP) for efficient evaluation of the individual terms of the discrete Fourier transform (DFT). It is useful in certain practical applications, such as recognition of dual-tone multi-frequency signaling (DTMF) tones produced by the push buttons of the keypad of a traditional analog telephone. The algorithm was first described by Gerald Goertzel in 1958.

In mathematics, a wavelet series is a representation of a square-integrable function by a certain orthonormal series generated by a wavelet. This article provides a formal, mathematical definition of an orthonormal wavelet and of the integral wavelet transform.

In mathematics, Fourier–Bessel series is a particular kind of generalized Fourier series based on Bessel functions.

In statistical signal processing, the goal of spectral density estimation (SDE) or simply spectral estimation is to estimate the spectral density of a signal from a sequence of time samples of the signal. Intuitively speaking, the spectral density characterizes the frequency content of the signal. One purpose of estimating the spectral density is to detect any periodicities in the data, by observing peaks at the frequencies corresponding to these periodicities.

References

↑ Min Xu; et al. (2004). "HMM-based audio keyword generation" (PDF). In Kiyoharu Aizawa; Yuichi Nakamura; Shin'ichi Satoh (eds.). Advances in Multimedia Information Processing – PCM 2004: 5th Pacific Rim Conference on Multimedia. Springer. ISBN 978-3-540-23985-7. Archived from the original (PDF) on 2007-05-10.
↑ Sahidullah, Md.; Saha, Goutam (May 2012). "Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition". Speech Communication. 54 (4): 543–565. doi:10.1016/j.specom.2011.11.004. S2CID 14985832.
↑ Abdulsatar, Assim Ara; Davydov, V V; Yushkova, V V; Glinushkin, A P; Rud, V Yu (2019-12-01). "Age and gender recognition from speech signals". Journal of Physics: Conference Series. 1410 (1): 012073. Bibcode:2019JPhCS1410a2073A. doi: 10.1088/1742-6596/1410/1/012073 . ISSN 1742-6588. S2CID 213065622.
1 2 Fang Zheng, Guoliang Zhang and Zhanjiang Song (2001), "Comparison of Different Implementations of MFCC," J. Computer Science & Technology, 16(6): 582–589.
1 2 S. Furui (1986), "Speaker-independent isolated word recognition based on emphasized spectral dynamics"
↑ European Telecommunications Standards Institute (2003), Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms. Technical standard ES 201 108, v1.1.3.
↑ T. Ganchev, N. Fakotakis, and G. Kokkinakis (2005), "Comparative evaluation of various MFCC implementations on the speaker verification task Archived 2011-07-17 at the Wayback Machine ," in 10th International Conference on Speech and Computer (SPECOM 2005), Vol. 1, pp. 191–194.
↑ Meinard Müller (2007). Information Retrieval for Music and Motion. Springer. p. 65. ISBN 978-3-540-74047-6.
↑ "librosa.feature.inverse.mfcc_to_audio — librosa 0.10.0 documentation". librosa.org.
↑ V. Tyagi and C. Wellekens (2005), On desensitizing the Mel-Cepstrum to spurious spectral components for Robust Speech Recognition, in Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, vol. 1, pp. 529–532.
1 2 P. Mermelstein (1976), "Distance measures for speech recognition, psychological and instrumental," in Pattern Recognition and Artificial Intelligence, C. H. Chen, Ed., pp. 374–388. Academic, New York.
1 2 S.B. Davis, and P. Mermelstein (1980), "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences," in IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), pp. 357–366.
↑ J. S. Bridle and M. D. Brown (1974), "An Experimental Automatic Word-Recognition System", JSRU Report No. 1003, Joint Speech Research Unit, Ruislip, England.
↑ Nelson Morgan; Hervé Bourlard & Hynek Hermansky (2004). "Automatic Speech Recognition: An Auditory Perspective". In Steven Greenberg & William A. Ainsworth (eds.). Speech Processing in the Auditory System. Springer. p. 315. ISBN 978-0-387-00590-4.
↑ L. C. W. Pols (1966), "Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words," Doctoral dissertation, Free University, Amsterdam, the Netherlands
↑ R. Plomp, L. C. W. Pols, and J. P. van de Geer (1967). "Dimensional analysis of vowel spectra." J. Acoustical Society of America, 41(3):707–712.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Min Xu; et al. (2004). "HMM-based audio keyword generation" (PDF). In Kiyoharu Aizawa; Yuichi Nakamura; Shin'ichi Satoh (eds.). Advances in Multimedia Information Processing – PCM 2004: 5th Pacific Rim Conference on Multimedia. Springer. ISBN 978-3-540-23985-7. Archived from the original (PDF) on 2007-05-10.

[2] Sahidullah, Md.; Saha, Goutam (May 2012). "Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition". Speech Communication. 54 (4): 543–565. doi:10.1016/j.specom.2011.11.004. S2CID 14985832.

[3] Abdulsatar, Assim Ara; Davydov, V V; Yushkova, V V; Glinushkin, A P; Rud, V Yu (2019-12-01). "Age and gender recognition from speech signals". Journal of Physics: Conference Series. 1410 (1): 012073. Bibcode:2019JPhCS1410a2073A. doi: 10.1088/1742-6596/1410/1/012073 . ISSN 1742-6588. S2CID 213065622.

[:0-4] 1 2 Fang Zheng, Guoliang Zhang and Zhanjiang Song (2001), "Comparison of Different Implementations of MFCC," J. Computer Science & Technology, 16(6): 582–589.

[:1-5] 1 2 S. Furui (1986), "Speaker-independent isolated word recognition based on emphasized spectral dynamics"

[etsi01-6] European Telecommunications Standards Institute (2003), Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms. Technical standard ES 201 108, v1.1.3.

[7] T. Ganchev, N. Fakotakis, and G. Kokkinakis (2005), "Comparative evaluation of various MFCC implementations on the speaker verification task Archived 2011-07-17 at the Wayback Machine ," in 10th International Conference on Speech and Computer (SPECOM 2005), Vol. 1, pp. 191–194.

[8] Meinard Müller (2007). Information Retrieval for Music and Motion. Springer. p. 65. ISBN 978-3-540-74047-6.

[9] "librosa.feature.inverse.mfcc_to_audio — librosa 0.10.0 documentation". librosa.org.

[10] V. Tyagi and C. Wellekens (2005), On desensitizing the Mel-Cepstrum to spurious spectral components for Robust Speech Recognition, in Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, vol. 1, pp. 529–532.

[merm76-11] 1 2 P. Mermelstein (1976), "Distance measures for speech recognition, psychological and instrumental," in Pattern Recognition and Artificial Intelligence, C. H. Chen, Ed., pp. 374–388. Academic, New York.

[merm80-12] 1 2 S.B. Davis, and P. Mermelstein (1980), "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences," in IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), pp. 357–366.

[13] J. S. Bridle and M. D. Brown (1974), "An Experimental Automatic Word-Recognition System", JSRU Report No. 1003, Joint Speech Research Unit, Ruislip, England.

[14] Nelson Morgan; Hervé Bourlard & Hynek Hermansky (2004). "Automatic Speech Recognition: An Auditory Perspective". In Steven Greenberg & William A. Ainsworth (eds.). Speech Processing in the Auditory System. Springer. p. 315. ISBN 978-0-387-00590-4.

[15] L. C. W. Pols (1966), "Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words," Doctoral dissertation, Free University, Amsterdam, the Netherlands

[16] R. Plomp, L. C. W. Pols, and J. P. van de Geer (1967). "Dimensional analysis of vowel spectra." J. Acoustical Society of America, 41(3):707–712.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]