Symbolic regression

Last updated

Expression tree as it can be used in symbolic regression to represent a function. Genetic Program Tree.png
Expression tree as it can be used in symbolic regression to represent a function.

Symbolic regression (SR) is a type of regression analysis that searches the space of mathematical expressions to find the model that best fits a given dataset, both in terms of accuracy and simplicity.

Contents

No particular model is provided as a starting point for symbolic regression. Instead, initial expressions are formed by randomly combining mathematical building blocks such as mathematical operators, analytic functions, constants, and state variables. Usually, a subset of these primitives will be specified by the person operating it, but that's not a requirement of the technique. The symbolic regression problem for mathematical functions has been tackled with a variety of methods, including recombining equations most commonly using genetic programming, [1] as well as more recent methods utilizing Bayesian methods [2] and neural networks. [3] Another non-classical alternative method to SR is called Universal Functions Originator (UFO), which has a different mechanism, search-space, and building strategy. [4] Further methods such as Exact Learning attempt to transform the fitting problem into a moments problem in a natural function space, usually built around generalizations of the Meijer-G function. [5]

By not requiring a priori specification of a model, symbolic regression isn't affected by human bias, or unknown gaps in domain knowledge. It attempts to uncover the intrinsic relationships of the dataset, by letting the patterns in the data itself reveal the appropriate models, rather than imposing a model structure that is deemed mathematically tractable from a human perspective. The fitness function that drives the evolution of the models takes into account not only error metrics (to ensure the models accurately predict the data), but also special complexity measures, [6] thus ensuring that the resulting models reveal the data's underlying structure in a way that's understandable from a human perspective. This facilitates reasoning and favors the odds of getting insights about the data-generating system, as well as improving generalisability and extrapolation behaviour by preventing overfitting. Accuracy and simplicity may be left as two separate objectives of the regression—in which case the optimum solutions form a Pareto front—or they may be combined into a single objective by means of a model selection principle such as minimum description length.

It has been proven that symbolic regression is an NP-hard problem, in the sense that one cannot always find the best possible mathematical expression to fit to a given dataset in polynomial time. [7] Nevertheless, if the sought-for equation is not too complex it is possible to solve the symbolic regression problem exactly by generating every possible function (built from some predefined set of operators) and evaluating them on the dataset in question. [8]

Difference from classical regression

While conventional regression techniques seek to optimize the parameters for a pre-specified model structure, symbolic regression avoids imposing prior assumptions, and instead infers the model from the data. In other words, it attempts to discover both model structures and model parameters.

This approach has the disadvantage of having a much larger space to search, because not only the search space in symbolic regression is infinite, but there are an infinite number of models which will perfectly fit a finite data set (provided that the model complexity isn't artificially limited). This means that it will possibly take a symbolic regression algorithm longer to find an appropriate model and parametrization, than traditional regression techniques. This can be attenuated by limiting the set of building blocks provided to the algorithm, based on existing knowledge of the system that produced the data; but in the end, using symbolic regression is a decision that has to be balanced with how much is known about the underlying system.

Nevertheless, this characteristic of symbolic regression also has advantages: because the evolutionary algorithm requires diversity in order to effectively explore the search space, the result is likely to be a selection of high-scoring models (and their corresponding set of parameters). Examining this collection could provide better insight into the underlying process, and allows the user to identify an approximation that better fits their needs in terms of accuracy and simplicity.

Benchmarking

SRBench

In 2021, SRBench [9] was proposed as a large benchmark for symbolic regression. In its inception, SRBench featured 14 symbolic regression methods, 7 other ML methods, and 252 datasets from PMLB. The benchmark intends to be a living project: it encourages the submission of improvements, new datasets, and new methods, to keep track of the state of the art in SR.

SRBench Competition 2022

In 2022, SRBench announced the competition Interpretable Symbolic Regression for Data Science, which was held at the GECCO conference in Boston, MA. The competition pitted nine leading symbolic regression algorithms against each other on a novel set of data problems and considered different evaluation criteria. The competition was organized in two tracks, a synthetic track and a real-world data track. [10]

Synthetic Track

In the synthetic track, methods were compared according to five properties: re-discovery of exact expressions; feature selection; resistance to local optima; extrapolation; and sensitivity to noise. Rankings of the methods were:

  1. QLattice
  2. PySR (Python Symbolic Regression)
  3. uDSR (Deep Symbolic Optimization)

Real-world Track

In the real-world track, methods were trained to build interpretable predictive models for 14-day forecast counts of COVID-19 cases, hospitalizations, and deaths in New York State. These models were reviewed by a subject expert and assigned trust ratings and evaluated for accuracy and simplicity. The ranking of the methods was:

  1. uDSR (Deep Symbolic Optimization)
  2. QLattice
  3. geneticengine (Genetic Engine)

Non-standard methods

Most symbolic regression algorithms prevent combinatorial explosion by implementing evolutionary algorithms that iteratively improve the best-fit expression over many generations. Recently, researchers have proposed algorithms utilizing other tactics in AI.

Silviu-Marian Udrescu and Max Tegmark developed the "AI Feynman" algorithm, [11] [12] which attempts symbolic regression by training a neural network to represent the mystery function, then runs tests against the neural network to attempt to break up the problem into smaller parts. For example, if , tests against the neural network can recognize the separation and proceed to solve for and separately and with different variables as inputs. This is an example of divide and conquer, which reduces the size of the problem to be more manageable. AI Feynman also transforms the inputs and outputs of the mystery function in order to produce a new function which can be solved with other techniques, and performs dimensional analysis to reduce the number of independent variables involved. The algorithm was able to "discover" 100 equations from The Feynman Lectures on Physics, while a leading software using evolutionary algorithms, Eureqa, solved only 71. AI Feynman, in contrast to classic symbolic regression methods, requires a very large dataset in order to first train the neural network and is naturally biased towards equations that are common in elementary physics.

Software

End-user software

See also

Related Research Articles

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

<span class="mw-page-title-main">Genetic algorithm</span> Competitive algorithm for searching a problem space

In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to generate high-quality solutions to optimization and search problems via biologically inspired operators such as selection, crossover, and mutation. Some examples of GA applications include optimizing decision trees for better performance, solving sudoku puzzles, hyperparameter optimization, and causal inference.

<span class="mw-page-title-main">Evolutionary algorithm</span> Subset of evolutionary computation

Evolutionary algorithms (EA) reproduce essential elements of the biological evolution in a computer algorithm in order to solve “difficult” problems, at least approximately, for which no exact or satisfactory solution methods are known. They belong to the class of metaheuristics and are a subset of evolutionary computation, which itself is part of the field of computational intelligence. The mechanisms of biological evolution that an EA mainly imitates are reproduction, mutation, recombination and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the quality of the solutions (see also loss function). Evolution of the population then takes place after the repeated application of the above operators.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Advances in the field of deep learning have allowed neural networks to surpass many previous approaches in performance.

<span class="mw-page-title-main">Evolutionary computation</span> Trial and error problem solvers with a metaheuristic or stochastic optimization character

Evolutionary computation from computer science is a family of algorithms for global optimization inspired by biological evolution, and the subfield of artificial intelligence and soft computing studying these algorithms. In technical terms, they are a family of population-based trial and error problem solvers with a metaheuristic or stochastic optimization character.

<span class="mw-page-title-main">Particle swarm optimization</span> Iterative simulation method

In computational science, particle swarm optimization (PSO) is a computational method that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. It solves a problem by having a population of candidate solutions, here dubbed particles, and moving these particles around in the search-space according to simple mathematical formulae over the particle's position and velocity. Each particle's movement is influenced by its local best known position, but is also guided toward the best known positions in the search-space, which are updated as better positions are found by other particles. This is expected to move the swarm toward the best solutions.

<span class="mw-page-title-main">Time series</span> Sequence of data points over time

In mathematics, a time series is a series of data points indexed in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.

<span class="mw-page-title-main">Gene expression programming</span> Evolutionary algorithm

Gene expression programming (GEP) in computer programming is an evolutionary algorithm that creates computer programs or models. These computer programs are complex tree structures that learn and adapt by changing their sizes, shapes, and composition, much like a living organism. And like living organisms, the computer programs of GEP are also encoded in simple linear chromosomes of fixed length. Thus, GEP is a genotype–phenotype system, benefiting from a simple genome to keep and transmit the genetic information and a complex phenotype to explore the environment and adapt to it.

In machine learning, feature selection is the process of selecting a subset of relevant features for use in model construction. Feature selection techniques are used for several reasons:

The expression computational intelligence (CI) usually refers to the ability of a computer to learn a specific task from data or experimental observation. Even though it is commonly considered a synonym of soft computing, there is still no commonly accepted definition of computational intelligence.

Group method of data handling (GMDH) is a family of inductive algorithms for computer-based mathematical modeling of multi-parametric datasets that features fully automatic structural and parametric optimization of models.

LIONsolver is an integrated software for data mining, business intelligence, analytics, and modeling and reactive business intelligence approach. A non-profit version is also available as LIONoso.

Bayesian optimization is a sequential design strategy for global optimization of black-box functions, that does not assume any functional forms. It is usually employed to optimize expensive-to-evaluate functions. With the rise of artificial intelligence innovation in the 21st century, Bayesian optimizations have found prominent use in machine learning problems, for optimizing hyperparameter values.

mlpack

mlpack is a free, open-source and header-only software library for machine learning and artificial intelligence written in C++, built on top of the Armadillo library and the ensmallen numerical optimization library. mlpack has an emphasis on scalability, speed, and ease-of-use. Its aim is to make machine learning possible for novice users by means of a simple, consistent API, while simultaneously exploiting C++ language features to provide maximum performance and maximum flexibility for expert users. mlpack has also a light deployment infrastructure with minimum dependencies, making it perfect for embedded systems and low resource devices. Its intended target users are scientists and engineers.

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence (AI), its subdisciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

The following outline is provided as an overview of, and topical guide to, machine learning:

A discovery system is an artificial intelligence system that attempts to discover new scientific concepts or laws. The aim of discovery systems is to automate scientific data analysis and the scientific discovery process. Ideally, an artificial intelligence system should be able to search systematically through the space of all possible hypotheses and yield the hypothesis - or set of equally likely hypotheses - that best describes the complex patterns in data.

Soft computing is an umbrella term used to describe types of algorithms that produce approximate solutions to unsolvable high-level problems in computer science. Typically, traditional hard-computing algorithms heavily rely on concrete data and mathematical models to produce solutions to problems. Soft computing was coined in the late 20th century. During this period, revolutionary research in three fields greatly impacted soft computing. Fuzzy logic is a computational paradigm that entertains the uncertainties in data by using levels of truth rather than rigid 0s and 1s in binary. Next, neural networks which are computational models influenced by human brain functions. Finally, evolutionary computation is a term to describe groups of algorithm that mimic natural processes such as evolution and natural selection.

The QLattice is a software library which provides a framework for symbolic regression in Python. It works on Linux, Windows, and macOS. The QLattice algorithm is developed by the Danish/Spanish AI research company Abzu. Since its creation, the QLattice has attracted significant attention, mainly for the inherent explainability of the models it produces.

References

  1. Michael Schmidt; Hod Lipson (2009). "Distilling free-form natural laws from experimental data". Science. 324 (5923). American Association for the Advancement of Science: 81–85. Bibcode:2009Sci...324...81S. CiteSeerX   10.1.1.308.2245 . doi:10.1126/science.1165893. PMID   19342586. S2CID   7366016.
  2. Ying Jin; Weilin Fu; Jian Kang; Jiadong Guo; Jian Guo (2019). "Bayesian Symbolic Regression". arXiv: 1910.08892 [stat.ME].
  3. 1 2 Silviu-Marian Udrescu; Max Tegmark (2020). "AI Feynman: A physics-inspired method for symbolic regression". Science_Advances. 6 (16). American Association for the Advancement of Science: eaay2631. Bibcode:2020SciA....6.2631U. doi:10.1126/sciadv.aay2631. PMC   7159912 . PMID   32426452.
  4. Ali R. Al-Roomi; Mohamed E. El-Hawary (2020). "Universal Functions Originator". Applied Soft Computing. 94. Elsevier B.V.: 106417. doi:10.1016/j.asoc.2020.106417. ISSN   1568-4946. S2CID   219743405.
  5. Benedict W. J. Irwin (2021). "A Natural Representation of Functions for Exact Learning" (PDF) (Preprint). doi:10.21203/rs.3.rs-149856/v1. S2CID   234014141.
  6. Ekaterina J. Vladislavleva; Guido F. Smits; Dick Den Hertog (2009). "Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming" (PDF). IEEE Transactions on Evolutionary Computation. 13 (2): 333–349. doi:10.1109/tevc.2008.926486. S2CID   12072764.
  7. Virgolin, Marco; Pissis, Solon P. (2022). "Symbolic Regression is NP-hard". Transactions on Machine Learning Research.
  8. Bartlett, Deaglan; Desmond, Harry; Ferreira, Pedro (2023). "Exhaustive Symbolic Regression". IEEE Transactions on Evolutionary Computation. 28 (4): 1. arXiv: 2211.11461 . doi:10.1109/TEVC.2023.3280250. S2CID   253735380.
  9. La Cava, William; Orzechowski, Patryk; Burlacu, Bogdan; de Franca, Fabricio; Virgolin, Marco; Jin, Ying; Kommenda, Michael; Moore, Jason (2021). "Contemporary Symbolic Regression Methods and their Relative Performance". Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. 1. arXiv: 2107.14351 .
  10. Michael Kommenda; William La Cava; Maimuna Majumder; Fabricio Olivetti de França; Marco Virgolin. "SRBench Competition 2022: Interpretable Symbolic Regression for Data Science".
  11. Udrescu, Silviu-Marian; Tegmark, Max (2020-04-17). "AI Feynman: A physics-inspired method for symbolic regression". Science Advances. 6 (16): eaay2631. arXiv: 1905.11481 . Bibcode:2020SciA....6.2631U. doi:10.1126/sciadv.aay2631. ISSN   2375-2548. PMC   7159912 . PMID   32426452.
  12. Udrescu, Silviu-Marian; Tan, Andrew; Feng, Jiahai; Neto, Orisvaldo; Wu, Tailin; Tegmark, Max (2020-12-16). "AI Feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity". arXiv: 2006.10782 [cs.LG].
  13. "Feyn is a Python module for running the QLattice". June 22, 2022.
  14. Kevin René Broløs; Meera Vieira Machado; Chris Cave; Jaan Kasak; Valdemar Stentoft-Hansen; Victor Galindo Batanero; Tom Jelen; Casper Wilstrup (2021-04-12). "An Approach to Symbolic Regression Using Feyn". arXiv: 2104.05417 [cs.LG].
  15. Zhang, Hengzhe; Zhou, Aimin; Zhang, Hu (August 2022). "An Evolutionary Forest for Regression". IEEE Transactions on Evolutionary Computation. 26 (4): 735–749. doi:10.1109/TEVC.2021.3136667. ISSN   1089-778X.
  16. Zhang, Hengzhe; Zhou, Aimin; Chen, Qi; Xue, Bing; Zhang, Mengjie (2023). "SR-Forest: A Genetic Programming based Heterogeneous Ensemble Learning Method". IEEE Transactions on Evolutionary Computation. 28 (5): 1484–1498. doi:10.1109/TEVC.2023.3243172. ISSN   1089-778X.
  17. "Deep symbolic optimization". GitHub . June 22, 2022.
  18. "Differentiable Cartesian Genetic Programming, v1.6 Documentation". June 10, 2022.
  19. Izzo, Dario; Biscani, Francesco; Mereta, Alessio (2016). "Differentiable genetic programming". Proceedings of the European Conference on Genetic Programming. arXiv: 1611.04766 .
  20. "High-Performance Symbolic Regression in Python". GitHub . 18 August 2022.
  21. "'Machine Scientists' Distill the Laws of Physics From Raw Data". Quanta Magazine . May 10, 2022.

Further reading