Surrogate data, sometimes known as analogous data, [1] usually refers to time series data that is produced using well-defined (linear) models like ARMA processes that reproduce various statistical properties like the autocorrelation structure of a measured data set. [2] The resulting surrogate data can then for example be used for testing for non-linear structure in the empirical data; this is called surrogate data testing.
Surrogate or analogous data also refers to data used to supplement available data from which a mathematical model is built. Under this definition, it may be generated (i.e., synthetic data) or transformed from another source. [1]
Surrogate data is used in environmental and laboratory settings, when study data from one source is used in estimation of characteristics of another source. [3] For example, it has been used to model population trends in animal species. [4] It can also be used to model biodiversity, as it would be difficult to gather actual data on all species in a given area. [5]
Surrogate data may be used in forecasting. Data from similar series may be pooled to improve forecast accuracy. [6] Use of surrogate data may enable a model to account for patterns not seen in historical data. [7]
Another use of surrogate data is to test models for non-linearity. The term surrogate data testing refers to algorithms used to analyze models in this way. [8] These tests typically involve generating data, whereas surrogate data in general can be produced or gathered in many ways. [1]
One method of surrogate data is to find a source with similar conditions or parameters, and use those data in modeling. [4] Another method is to focus on patterns of the underlying system, and to search for a similar pattern in related data sources (for example, patterns in other related species or environmental areas). [5]
Rather than using existing data from a separate source, surrogate data may be generated through statistical processes, [2] which may involve random data generation [1] using constraints of the model or system. [8]
In mathematics, the Lyapunov exponent or Lyapunov characteristic exponent of a dynamical system is a quantity that characterizes the rate of separation of infinitesimally close trajectories. Quantitatively, two trajectories in phase space with initial separation vector diverge at a rate given by
In mathematics, a time series is a series of data points indexed in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.
J. Doyne Farmer is an American complex systems scientist and entrepreneur with interests in chaos theory, complexity and econophysics. He is Baillie Gifford Professor of Complex Systems Science at the Smith School of Enterprise and the Environment, Oxford University, where he is also director of the Complexity Economics programme at the Institute for New Economic Thinking at the Oxford Martin School. Additionally he is an external professor at the Santa Fe Institute. His current research is on complexity economics, focusing on systemic risk in financial markets and technological progress. During his career he has made important contributions to complex systems, chaos, artificial life, theoretical biology, time series forecasting and econophysics. He co-founded Prediction Company, one of the first companies to do fully automated quantitative trading. While a graduate student he led a group that called itself Eudaemonic Enterprises and built the first wearable digital computer, which was used to beat the game of roulette. He is a founder and the Chief Scientist of Macrocosm Inc, a company devoted to scaling up complexity economics methods and reducing them to practice.
In time series analysis, the Box–Jenkins method, named after the statisticians George Box and Gwilym Jenkins, applies autoregressive moving average (ARMA) or autoregressive integrated moving average (ARIMA) models to find the best fit of a time-series model to past values of a time series.
Ensemble forecasting is a method used in or within numerical weather prediction. Instead of making a single forecast of the most likely weather, a set of forecasts is produced. This set of forecasts aims to give an indication of the range of possible future states of the atmosphere. Ensemble forecasting is a form of Monte Carlo analysis. The multiple simulations are conducted to account for the two usual sources of uncertainty in forecast models: (1) the errors introduced by the use of imperfect initial conditions, amplified by the chaotic nature of the evolution equations of the atmosphere, which is often referred to as sensitive dependence on initial conditions; and (2) errors introduced because of imperfections in the model formulation, such as the approximate mathematical methods to solve the equations. Ideally, the verified future atmospheric state should fall within the predicted ensemble spread, and the amount of spread should be related to the uncertainty (error) of the forecast. In general, this approach can be used to make probabilistic forecasts of any dynamical system, and not just for weather prediction.
Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one. In the context of machine learning and more generally statistical analysis, this may be the selection of a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice.
A multifractal system is a generalization of a fractal system in which a single exponent is not enough to describe its dynamics; instead, a continuous spectrum of exponents is needed.
In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.
An echo state network (ESN) is a type of reservoir computer that uses a recurrent neural network with a sparsely connected hidden layer. The connectivity and weights of hidden neurons are fixed and randomly assigned. The weights of output neurons can be learned so that the network can produce or reproduce specific temporal patterns. The main interest of this network is that although its behavior is non-linear, the only weights that are modified during training are for the synapses that connect the hidden neurons to output neurons. Thus, the error function is quadratic with respect to the parameter vector and can be differentiated easily to a linear system.
Reservoir computing is a framework for computation derived from recurrent neural network theory that maps input signals into higher dimensional computational spaces through the dynamics of a fixed, non-linear system called a reservoir. After the input signal is fed into the reservoir, which is treated as a "black box," a simple readout mechanism is trained to read the state of the reservoir and map it to the desired output. The first key benefit of this framework is that training is performed only at the readout stage, as the reservoir dynamics are fixed. The second is that the computational power of naturally available systems, both classical and quantum mechanical, can be used to reduce the effective computational cost.
HD 11506 is a star in the equatorial constellation of Cetus. It has a yellow hue and can be viewed with a small telescope but is too faint to be visible to the naked eye, having an apparent visual magnitude of 7.51. The distance to this object is 167 light years based on parallax, but it is drifting closer to the Sun with a radial velocity of −7.5 km/s. It has an absolute magnitude of 3.94.
Modified Newtonian dynamics (MOND) is theory that proposes a modification of Newton's second law to account for observed properties of galaxies. Its primary motivation is to explain galaxy rotation curves without invoking dark matter, and is one of the most well-known theories of this class. However, it has not gained widespread acceptance, with the majority of astrophysicists supporting the Lambda-CDM model as providing the better fit to observations.
Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.
Surrogate data testing is a statistical proof by contradiction technique similar to permutation tests and parametric bootstrapping. It is used to detect non-linearity in a time series. The technique involves specifying a null hypothesis describing a linear process and then generating several surrogate data sets according to using Monte Carlo methods. A discriminating statistic is then calculated for the original time series and all the surrogate set. If the value of the statistic is significantly different for the original series than for the surrogate set, the null hypothesis is rejected and non-linearity assumed.
Brain connectivity estimators represent patterns of links in the brain. Connectivity can be considered at different levels of the brain's organisation: from neurons, to neural assemblies and brain structures. Brain connectivity involves different concepts such as: neuroanatomical or structural connectivity, functional connectivity and effective connectivity.
Quantum machine learning is the integration of quantum algorithms within machine learning programs.
Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.
Electricity price forecasting (EPF) is a branch of energy forecasting which focuses on using mathematical, statistical and machine learning models to predict electricity prices in the future. Over the last 30 years electricity price forecasts have become a fundamental input to energy companies’ decision-making mechanisms at the corporate level.
Boson sampling is a restricted model of non-universal quantum computation introduced by Scott Aaronson and Alex Arkhipov after the original work of Lidror Troyansky and Naftali Tishby, that explored possible usage of boson scattering to evaluate expectation values of permanents of matrices. The model consists of sampling from the probability distribution of identical bosons scattered by a linear interferometer. Although the problem is well defined for any bosonic particles, its photonic version is currently considered as the most promising platform for a scalable implementation of a boson sampling device, which makes it a non-universal approach to linear optical quantum computing. Moreover, while not universal, the boson sampling scheme is strongly believed to implement computing tasks which are hard to implement with classical computers by using far fewer physical resources than a full linear-optical quantum computing setup. This advantage makes it an ideal candidate for demonstrating the power of quantum computation in the near term.
Physics-informed neural networks (PINNs), also referred to as Theory-Trained Neural Networks (TTNs), are a type of universal function approximators that can embed the knowledge of any physical laws that govern a given data-set in the learning process, and can be described by partial differential equations (PDEs).They overcome the low data availability of some biological and engineering systems that makes most state-of-the-art machine learning techniques lack robustness, rendering them ineffective in these scenarios. The prior knowledge of general physical laws acts in the training of neural networks (NNs) as a regularization agent that limits the space of admissible solutions, increasing the correctness of the function approximation. This way, embedding this prior information into a neural network results in enhancing the information content of the available data, facilitating the learning algorithm to capture the right solution and to generalize well even with a low amount of training examples.
What is Surrogate Data? Data from studies of test organisms or a test substance that are used to estimate the characteristics or effects on another organism or substance.