# Whittle likelihood

Last updated

In statistics, Whittle likelihood is an approximation to the likelihood function of a stationary Gaussian time series. It is named after the mathematician and statistician Peter Whittle, who introduced it in his PhD thesis in 1951. [1] It is commonly utilized in time series analysis and signal processing for parameter estimation and signal detection.

Statistics is a branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation. In applying statistics to, for example, a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model process to be studied. Populations can be diverse topics such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments. See glossary of probability and statistics.

In statistics, a likelihood function is a function of parameters within the parameter space that describes the probability of obtaining the observed data . It is proportional—up to a function of only the observed data—to the joint probability distribution of given . The likelihood principle states that all relevant information for inference about is contained in the likelihood function for the observed data given the assumed statistical model. The case for using likelihood in the foundation of statistics was first made by the founder of modern statistics, R. A. Fisher, who believed it to be a self-contained framework for statistical modelling and inference. But the likelihood function also plays a fundamental role in frequentist and Bayesian statistics.

A time series is a series of data points indexed in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.

## Context

In a stationary Gaussian time series model, the likelihood function is (as usual in Gaussian models) a function of the associated mean and covariance parameters. With a large number (${\displaystyle N}$) of observations, the (${\displaystyle N\times N}$) covariance matrix may become very large, making computations very costly in practice. However, due to stationarity, the covariance matrix has a rather simple structure, and by using an approximation, computations may be simplified considerably (from ${\displaystyle O(N^{2})}$ to ${\displaystyle O(N\log(N))}$). [2] The idea effectively boils down to assuming a heteroscedastic zero-mean Gaussian model in Fourier domain; the model formulation is based on the time series' discrete Fourier transform and its power spectral density. [3] [4] [5]

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In statistics, a collection of random variables is heteroscedastic if there are sub-populations that have different variabilities from others. Here "variability" could be quantified by the variance or any other measure of statistical dispersion. Thus heteroscedasticity is the absence of homoscedasticity.

In mathematics, the discrete Fourier transform (DFT) converts a finite sequence of equally-spaced samples of a function into a same-length sequence of equally-spaced samples of the discrete-time Fourier transform (DTFT), which is a complex-valued function of frequency. The interval at which the DTFT is sampled is the reciprocal of the duration of the input sequence. An inverse DFT is a Fourier series, using the DTFT samples as coefficients of complex sinusoids at the corresponding DTFT frequencies. It has the same sample-values as the original input sequence. The DFT is therefore said to be a frequency domain representation of the original input sequence. If the original sequence spans all the non-zero values of a function, its DTFT is continuous, and the DFT provides discrete samples of one cycle. If the original sequence is one cycle of a periodic function, the DFT provides all the non-zero values of one DTFT cycle.

## Definition

Let ${\displaystyle X_{1},\ldots ,X_{N}}$ be a stationary Gaussian time series with (one-sided) power spectral density ${\displaystyle S_{1}(f)}$, where ${\displaystyle N}$ is even and samples are taken at constant sampling intervals ${\displaystyle \Delta _{t}}$. Let ${\displaystyle {\tilde {X}}_{1},\ldots ,{\tilde {X}}_{N/2+1}}$ be the (complex-valued) discrete Fourier transform (DFT) of the time series. Then for the Whittle likelihood one effectively assumes independent zero-mean Gaussian distributions for all ${\displaystyle {\tilde {X}}_{j}}$ with variances for the real and imaginary parts given by

${\displaystyle \operatorname {Var} \left(\operatorname {Re} ({\tilde {X}}_{j})\right)=\operatorname {Var} \left(\operatorname {Im} ({\tilde {X}}_{j})\right)=S_{1}(f_{j})}$

where ${\displaystyle f_{j}={\frac {j}{N\,\Delta _{t}}}}$ is the ${\displaystyle j}$th Fourier frequency. This approximate model immediately leads to the (logarithmic) likelihood function

${\displaystyle \log \left(P(x_{1},\ldots ,x_{N})\right)\propto -\sum _{j}\left(\log \left(S_{1}(f_{j})\right)+{\frac {|{\tilde {x}}_{j}|^{2}}{{\frac {N}{2\,\Delta _{t}}}S_{1}(f_{j})}}\right)}$

where ${\displaystyle |\cdot |}$ denotes the absolute value with ${\displaystyle |{\tilde {x}}_{j}|^{2}=\left(\operatorname {Re} ({\tilde {x}}_{j})\right)^{2}+\left(\operatorname {Im} ({\tilde {x}}_{j})\right)^{2}}$

## Special case of a known noise spectrum

In case the noise spectrum is assumed a-priori known, and noise properties are not to be inferred from the data, the likelihood function may be simplified further by ignoring constant terms, leading to the sum-of-squares expression

${\displaystyle \log \left(P(x_{1},\ldots ,x_{N})\right)\;\propto \;-\sum _{j}{\frac {|{\tilde {x}}_{j}|^{2}}{{\frac {N}{2\,\Delta _{t}}}S_{1}(f_{j})}}}$

This expression also is the basis for the common matched filter.

In signal processing, a matched filter is obtained by correlating a known delayed signal, or template, with an unknown signal to detect the presence of the template in the unknown signal. This is equivalent to convolving the unknown signal with a conjugated time-reversed version of the template. The matched filter is the optimal linear filter for maximizing the signal-to-noise ratio (SNR) in the presence of additive stochastic noise.

## Accuracy of approximation

The Whittle likelihood in general is only an approximation, it is only exact if the spectrum is constant, i.e., in the trivial case of white noise. The efficiency of the Whittle approximation always depends on the particular circumstances. [7] [8]

In signal processing, white noise is a random signal having equal intensity at different frequencies, giving it a constant power spectral density. The term is used, with this or similar meanings, in many scientific and technical disciplines, including physics, acoustical engineering, telecommunications, and statistical forecasting. White noise refers to a statistical model for signals and signal sources, rather than to any specific signal. White noise draws its name from white light, although light that appears white generally does not have a flat power spectral density over the visible band.

In the comparison of various statistical procedures, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance. This article primarily deals with efficiency of estimators.

Note that due to linearity of the Fourier transform, Gaussianity in Fourier domain implies Gaussianity in time domain and vice versa. What makes the Whittle likelihood only approximately accurate is related to the sampling theorem—the effect of Fourier-transforming only a finite number of data points, which also manifests itself as spectral leakage in related problems (and which may be ameliorated using the same methods, namely, windowing). In the present case, the implicit periodicity assumption implies correlation between the first and last samples (${\displaystyle x_{1}}$ and ${\displaystyle x_{N}}$), which are effectively treated as "neighbouring" samples (like ${\displaystyle x_{1}}$ and ${\displaystyle x_{2}}$).

## Applications

### Parameter estimation

Whittle's likelihood is commonly used to estimate signal parameters for signals that are buried in non-white noise. The noise spectrum then may be assumed known, [9] or it may be inferred along with the signal parameters. [4] [6]

### Signal detection

Signal detection is commonly performed utilizing the matched filter, which is based on the Whittle likelihood for the case of a known noise power spectral density. [10] [11] The matched filter effectively does a maximum-likelihood fit of the signal to the noisy data and uses the resulting likelihood ratio as the detection statistic. [12]

The matched filter may be generalized to an analogous procedure based on a Student-t distribution by also considering uncertainty (e.g. estimation uncertainty) in the noise spectrum. On the technical side, this entails repeated or iterative matched-filtering. [12]

### Spectrum estimation

The Whittle likelihood is also applicable for estimation of the noise spectrum, either alone or in conjunction with signal parameters. [13] [14]

## Related Research Articles

Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model so the observed data is most probable. Specifically, this is done by finding the value of the parameter that maximizes the likelihood function , which is the joint probability of the observed data , over a parameter space . The point that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of inference within much of the quantitative research of the social and medical sciences.

In statistics and control theory, Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, containing statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe. The filter is named after Rudolf E. Kálmán, one of the primary developers of its theory.

The power spectrum of a time series describes the distribution of power into frequency components composing that signal. According to Fourier analysis, any physical signal can be decomposed into a number of discrete frequencies, or a spectrum of frequencies over a continuous range. The statistical average of a certain signal or sort of signal as analyzed in terms of its frequency content, is called its spectrum.

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the form:

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

In signal processing, a periodogram is an estimate of the spectral density of a signal. The term was coined by Arthur Schuster in 1898. Today, the periodogram is a component of more sophisticated methods. It is the most common tool for examining the amplitude vs frequency characteristics of FIR filters and window functions. FFT spectrum analyzers are also implemented as a time-sequence of periodograms.

In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that the subcomponents are non-Gaussian signals and that they are statistically independent from each other. ICA is a special case of blind source separation. A common example application is the "cocktail party problem" of listening in on one person's speech in a noisy room.

A sensor array is a group of sensors, usually deployed in a certain geometry pattern, used for collecting and processing electromagnetic or acoustic signals. The advantage of using a sensor array over using a single sensor lies in the fact that an array adds new dimensions to the observation, helping to estimate more parameters and improve the estimation performance. For example an array of radio antenna elements used for beamforming can increase antenna gain in the direction of the signal while decreasing the gain in other directions, i.e., increasing signal-to-noise ratio (SNR) by amplifying the signal coherently. Another example of sensor array application is to estimate the direction of arrival of impinging electromagnetic waves. The related processing method is called array signal processing. Application examples of array signal processing include radar/sonar, wireless communications, seismology, machine condition monitoring, astronomical observations fault diagnosis, etc.

In statistics, G-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.

In statistics, econometrics and signal processing, an autoregressive (AR) model is a representation of a type of random process; as such, it is used to describe certain time-varying processes in nature, economics, etc. The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term ; thus the model is in the form of a stochastic difference equation. In machine learning, an autoregressive model learns from a series of timed steps and takes measurements from previous actions as inputs for a regression model, in order to predict the value of the next time step.

A cyclostationary process is a signal having statistical properties that vary cyclically with time. A cyclostationary process can be viewed as multiple interleaved stationary processes. For example, the maximum daily temperature in New York City can be modeled as a cyclostationary process: the maximum temperature on July 21 is statistically different from the temperature on December 20; however, it is a reasonable approximation that the temperature on December 20 of different years has identical statistics. Thus, we can view the random process composed of daily maximum temperatures as 365 interleaved stationary processes, each of which takes on a new value once per year.

In applied mathematics, the Wiener–Khinchin theorem, also known as the Wiener–Khintchine theorem and sometimes as the Wiener–Khinchin–Einstein theorem or the Khinchin–Kolmogorov theorem, states that the autocorrelation function of a wide-sense-stationary random process has a spectral decomposition given by the power spectrum of that process.

The linear scale-space representation of an N-dimensional continuous signal,

In statistical signal processing, the goal of spectral density estimation (SDE) is to estimate the spectral density of a random signal from a sequence of time samples of the signal. Intuitively speaking, the spectral density characterizes the frequency content of the signal. One purpose of estimating the spectral density is to detect any periodicities in the data, by observing peaks at the frequencies corresponding to these periodicities.

In computer networks, self-similarity is a feature of network data transfer dynamics. When modeling network data dynamics the traditional time series models, such as an autoregressive moving average model, are not appropriate. This is because these models only provide a finite number of parameters in the model and thus interaction in a finite time window, but the network data usually have a long-range dependent temporal structure. A self-similar process is one way of modeling network data dynamics with such a long range correlation. This article defines and describes network data transfer dynamics in the context of a self-similar process. Properties of the process are shown and methods are given for graphing and estimating parameters modeling the self-similarity of network data.

Power spectral estimation forms the basis for distinguishing and tracking signals in the presence of noise and extracting information from available data. One dimensional signals are expressed in terms of a single domain while multidimensional signals are represented in wave vector and frequency spectrum. Therefore, spectral estimation in the case of multidimensional signals gets a bit tricky.

In mathematics and theoretical computer science, analysis of Boolean functions is the study of real-valued functions on or from a spectral perspective. The functions studied are often, but not always, Boolean-valued, making them Boolean functions. The area has found many applications in combinatorics, social choice theory, random graphs, and theoretical computer science, especially in hardness of approximation, property testing and PAC learning.

## References

1. Whittle, P. (1951). Hypothesis testing in times series analysis. Uppsala: Almqvist & Wiksells Boktryckeri AB.
2. Hurvich, C. (2002). "Whittle's approximation to the likelihood function" (PDF). NYU Stern.
3. Calder, M.; Davis, R. A. (1997), "An introduction to Whittle (1953) "The analysis of multiple stationary time series"", in Kotz, S.; Johnson, N. L., Breakthroughs in Statistics, Springer Series in Statistics, New York: Springer-Verlag, pp. 141–169, doi:10.1007/978-1-4612-0667-5_7, ISBN   978-0-387-94989-5
See also: Calder, M.; Davis, R. A. (1996), "An introduction to Whittle (1953) "The analysis of multiple stationary time series"", Technical report 1996/41, Department of Statistics, Colorado State University
4. Hannan, E. J. (1994), "The Whittle likelihood and frequency estimation", in Kelly, F. P., Probability, statistics and optimization; a tribute to Peter Whittle, Chichester: Wiley
5. Pawitan, Y. (1998), "Whittle likelihood", in Kotz, S.; Read, C. B.; Banks, D. L., Encyclopedia of Statistical Sciences, Update Volume 2, New York: Wiley & Sons, pp. 708–710, doi:10.1002/0471667196.ess0753, ISBN   978-0471667193
6. Röver, C.; Meyer, R.; Christensen, N. (2011). "Modelling coloured residual noise in gravitational-wave signal processing". Classical and Quantum Gravity. 28 (1): 025010. arXiv:. Bibcode:2011CQGra..28a5010R. doi:10.1088/0264-9381/28/1/015010.
7. Choudhuri, N.; Ghosal, S.; Roy, A. (2004). "Contiguity of the Whittle measure for a Gaussian time series". Biometrika. 91 (4): 211–218. doi:10.1093/biomet/91.1.211.
8. Countreras-Cristán, A.; Gutiérrez-Peña, E.; Walker, S. G. (2006). "A Note on Whittle's Likelihood". Communications in Statistics – Simulation and Computation. 35 (4): 857–875. doi:10.1080/03610910600880203.
9. Finn, L. S. (1992). "Detection, measurement and gravitational radiation". Physical Review D. 46 (12): 5236–5249. arXiv:. Bibcode:1992PhRvD..46.5236F. doi:10.1103/PhysRevD.46.5236.
10. Turin, G. L. (1960). "An introduction to matched filters". IRE Transactions on Information Theory. 6 (3): 311–329. doi:10.1109/TIT.1960.1057571.
11. Wainstein, L. A.; Zubakov, V. D. (1962). Extraction of signals from noise. Englewood Cliffs, NJ: Prentice-Hall.
12. Röver, C. (2011). "Student-t-based filter for robust signal detection". Physical Review D. 84 (12): 122004. arXiv:. Bibcode:2011PhRvD..84l2004R. doi:10.1103/PhysRevD.84.122004.
13. Choudhuri, N.; Ghosal, S.; Roy, A. (2004). "Bayesian estimation of the spectral density of a time series" (PDF). Journal of the American Statistical Association. 99 (468): 1050–1059. CiteSeerX  . doi:10.1198/016214504000000557.
14. Edwards, M. C.; Meyer, R.; Christensen, N. (2015). "Bayesian semiparametric power spectral density estimation in gravitational wave data analysis". Physical Review D. 92 (6): 064011. arXiv:. Bibcode:2015PhRvD..92f4011E. doi:10.1103/PhysRevD.92.064011.