Mixed-data sampling

Last updated November 25, 2024

Econometric models involving data sampled at different frequencies are of general interest. Mixed-data sampling (MIDAS) is an econometric regression developed by Eric Ghysels with several co-authors. There is now a substantial literature on MIDAS regressions and their applications, including Ghysels, Santa-Clara and Valkanov (2006),^[1] Ghysels, Sinko and Valkanov,^[2] Andreou, Ghysels and Kourtellos (2010)^[3] and Andreou, Ghysels and Kourtellos (2013).^[4]

MIDAS Regressions

A MIDAS regression is a direct forecasting tool which can relate future low-frequency data with current and lagged high-frequency indicators, and yield different forecasting models for each forecast horizon. It can flexibly deal with data sampled at different frequencies and provide a direct forecast of the low-frequency variable. It incorporates each individual high-frequency data in the regression, which solves the problems of losing potentially useful information and including mis-specification.

A simple regression example has the independent variable appearing at a higher frequency than the dependent variable:

y_{t}=\beta _{0}+\beta _{1}B(L^{1/m};\theta )x_{t}^{(m)}+\varepsilon _{t}^{(m)},

where y is the dependent variable, x is the regressor, m denotes the frequency – for instance if y is yearly $x_{t}^{(4)}$ is quarterly – $\varepsilon$ is the disturbance and $B(L^{1/m};\theta )$ is a lag distribution, for instance the Beta function or the Almon Lag. For example $B(L^{1/m};\theta )=\sum _{k=0}^{K}B(k;\theta )L^{k/m}$ .

The regression models can be viewed in some cases as substitutes for the Kalman filter when applied in the context of mixed frequency data. Bai, Ghysels and Wright (2013)^[5] examine the relationship between MIDAS regressions and Kalman filter state space models applied to mixed frequency data. In general, the latter involves a system of equations, whereas, in contrast, MIDAS regressions involve a (reduced form) single equation. As a consequence, MIDAS regressions might be less efficient, but also less prone to specification errors. In cases where the MIDAS regression is only an approximation, the approximation errors tend to be small.

Machine Learning MIDAS Regressions

The MIDAS can also be used for machine learning time series and panel data nowcasting.^[6]^[7] The machine learning MIDAS regressions involve Legendre polynomials. High-dimensional mixed frequency time series regressions involve certain data structures that once taken into account should improve the performance of unrestricted estimators in small samples. These structures are represented by groups covering lagged dependent variables and groups of lags for a single (high-frequency) covariate. To that end, the machine learning MIDAS approach exploits the sparse-group LASSO (sg-LASSO) regularization that accommodates conveniently such structures.^[8] The attractive feature of the sg-LASSO estimator is that it allows us to combine effectively the approximately sparse and dense signals.

Software packages

Several software packages feature MIDAS regressions and related econometric methods. These include:

MIDAS Matlab Toolbox^[9]
midasr, R package^[10]
midasml, R package for High-Dimensional Mixed Frequency Time Series Data^[11]
EViews^[12]
Python^[13]
Julia^[14]
Stata，midasreg

Alternatives

In some situations it might be possible to alternatively use temporal disaggregation methods (for upsampling time series data from e.g. monthly to daily).^[15]

Related Research Articles

A likelihood function measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the joint probability distribution of the random variable that (presumably) generated the observations. When evaluated on the actual data points, it becomes a function solely of the model parameters.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

In the statistical analysis of time series, autoregressive–moving-average (ARMA) models are a way to describe of a (weakly) stationary stochastic process using autoregression (AR) and a moving average (MA), each with a polynomial. They are a tool for understanding a series and predicting future values. AR involves regressing the variable on its own lagged (i.e., past) values. MA involves modeling the error as a linear combination of error terms occurring contemporaneously and at various times in the past. The model is usually denoted ARMA(p, q), where p is the order of AR and q is the order of MA.

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees. Random forests correct for decision trees' habit of overfitting to their training set.

In statistics, econometrics, and signal processing, an autoregressive (AR) model is a representation of a type of random process; as such, it can be used to describe certain time-varying processes in nature, economics, behavior, etc. The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term ; thus the model is in the form of a stochastic difference equation which should not be confused with a differential equation. Together with the moving-average (MA) model, it is a special case and key component of the more general autoregressive–moving-average (ARMA) and autoregressive integrated moving average (ARIMA) models of time series, which have a more complicated stochastic structure; it is also a special case of the vector autoregressive model (VAR), which consists of a system of more than one interlocking stochastic difference equation in more than one evolving random variable.

In time series analysis, the lag operator (L) or backshift operator (B) operates on an element of a time series to produce the previous element. For example, given some time series

In time series analysis used in statistics and econometrics, autoregressive integrated moving average (ARIMA) and seasonal ARIMA (SARIMA) models are generalizations of the autoregressive moving average (ARMA) model to non-stationary series and periodic variation, respectively. All these models are fitted to time series in order to better understand it and predict future values. The purpose of these generalizations is to fit the data as well as possible. Specifically, ARMA assumes that the series is stationary, that is, its expected value is constant in time. If instead the series has a trend, the trend is removed by "differencing", leaving a stationary series. This operation generalizes ARMA and corresponds to the "integrated" part of ARIMA. Analogously, periodic variation is removed by "seasonal differencing".

In statistics, the score test assesses constraints on statistical parameters based on the gradient of the likelihood function—known as the score—evaluated at the hypothesized parameter value under the null hypothesis. Intuitively, if the restricted estimator is near the maximum of the likelihood function, the score should not differ from zero by more than sampling error. While the finite sample distributions of score tests are generally unknown, they have an asymptotic χ²-distribution under the null hypothesis as first proved by C. R. Rao in 1948, a fact that can be used to determine statistical significance.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.

In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.

In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

Multilevel models are statistical models of parameters that vary at more than one level. An example could be a model of student performance that contains measures for individual students as well as measures for classrooms within which the students are grouped. These models can be seen as generalizations of linear models, although they can also extend to non-linear models. These models became much more popular after sufficient computing power and software became available.

<span class="mw-page-title-main">Eric Ghysels</span> Belgian economist (born 1956)

Eric Ghysels is a Belgian economist with interest in finance and time series econometrics, and in particular the fields of financial econometrics and financial technology. He is the Edward M. Bernstein Distinguished Professor of Economics at the University of North Carolina and a Professor of Finance at the Kenan-Flagler Business School. He is also the Faculty Research Director of the Rethinc.Labs at the Frank Hawkins Kenan Institute of Private Enterprise.

In time series analysis, the moving-average model, also known as moving-average process, is a common approach for modeling univariate time series. The moving-average model specifies that the output variable is cross-correlated with a non-identical to itself random-variable.

Bayesian econometrics is a branch of econometrics which applies Bayesian principles to economic modelling. Bayesianism is based on a degree-of-belief interpretation of probability, as opposed to a relative-frequency interpretation.

Nowcasting in economics is the prediction of the very recent past, the present, and the very near future state of an economic indicator. The term is a portmanteau of "now" and "forecasting" and originates in meteorology. Typical measures used to assess the state of an economy, such as gross domestic product (GDP) or inflation, are only determined after a delay and are subject to revision. In these cases, nowcasting such indicators can provide an estimate of the variables before the true data are known. Nowcasting models have been applied most notably in Central Banks, who use the estimates to monitor the state of the economy in real-time as a proxy for official measures.

In statistics, ordinal regression, also called ordinal classification, is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. It can be considered an intermediate problem between regression and classification. Examples of ordinal regression are ordered logit and ordered probit. Ordinal regression turns up often in the social sciences, for example in the modeling of human levels of preference, as well as in information retrieval. In machine learning, ordinal regression may also be called ranking learning.

References

↑ Ghysels, Eric, Pedro Santa-Clara and Rossen Valkanov (2006) Predicting Volatility: How to Get Most Out of Returns Data Sampled at Different Frequencies, Journal of Econometrics, 131, 59-95
↑ Ghysels, Eric and Arthur Sinko and Rossen Valkanov (2006) MIDAS Regressions: Further Results and New Directions, Econometric Reviews, 26, 53-90.
↑ Andreou, Elena & Eric Ghysels & Andros Kourtellos (2010) "Regression Models with Mixed Sampling Frequencies", Journal of Econometrics, 158, 246-261.
↑ Andreou, Elena & Eric Ghysels & Andros Kourtellos (2013) "Should macroeconomic forecasters use daily financial data and how?", Journal of Business and Economic Statistics 31, 240-251.
↑ Bai, Jennie and Eric Ghysels and Jonathan Wright (2013) State Space Models and MIDAS Regressions, Econometric Reviews, 32, 779–813.
↑ Babii, Andrii; Ghysels, Eric; Striaukas, Jonas (2022-07-03). "Machine Learning Time Series Regressions With an Application to Nowcasting". Journal of Business & Economic Statistics. 40 (3): 1094–1106. arXiv: 2005.14057 . doi:10.1080/07350015.2021.1899933. ISSN 0735-0015.
↑ Babii, Andrii; Ball, Ryan T.; Ghysels, Eric; Striaukas, Jonas (2022-07-26). "Machine learning panel data regressions with heavy-tailed dependent data: Theory and application". Journal of Econometrics: 105315. arXiv: 2008.03600 . doi:10.1016/j.jeconom.2022.07.001. ISSN 0304-4076.
↑ Simon, N., J. Friedman, T. Hastie, and R. Tibshirani (2013): A sparse-group LASSO, Journal of Computational and Graphical Statistics, 22(2), 231-245.
↑ "MIDAS Matlab Toolbox maintained by Hang Qian".
↑ "midasr: Mixed Data Sampling Regression maintained by Virmantas Kvedaras and Vaidotas Zemlys-Balevicius". 23 February 2021.
↑ "midasml: Estimation and Prediction Methods for High-Dimensional Mixed Frequency Time Series Data maintained by Jonas Striaukas". 29 April 2022.
↑ "EViews 9.5 MIDAS Forecasting Demonstration".
↑ "MIDAS Python code". GitHub .
↑ "MIDAS Julia". GitHub .
↑ F. T. Denton. Adjustment of monthly or quarterly series to annual totals: An approach based on quadratic minimization. Journal of the American Statistical Association, Mar. 1971