XGBoost

Last updated
XGBoost
Developer(s) The XGBoost Contributors
Initial releaseMarch 27, 2014;9 years ago (2014-03-27)
Stable release
2.0.3 [1]   OOjs UI icon edit-ltr-progressive.svg / 19 December 2023;2 months ago (19 December 2023)
Repository
Written in C++
Operating system Linux, macOS, Microsoft Windows
Type Machine learning
License Apache License 2.0
Website xgboost.ai

XGBoost [2] (eXtreme Gradient Boosting) is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, [3] R, [4] Julia, [5] Perl, [6] and Scala. It works on Linux, Microsoft Windows, [7] and macOS. [8] From the project description, it aims to provide a "Scalable, Portable and Distributed Gradient Boosting (GBM, GBRT, GBDT) Library". It runs on a single machine, as well as the distributed processing frameworks Apache Hadoop, Apache Spark, Apache Flink, and Dask. [9] [10]

Contents

XGBoost gained much popularity and attention in the mid-2010s as the algorithm of choice for many winning teams of machine learning competitions. [11]

History

XGBoost initially started as a research project by Tianqi Chen [12] as part of the Distributed (Deep) Machine Learning Community (DMLC) group. Initially, it began as a terminal application which could be configured using a libsvm configuration file. It became well known in the ML competition circles after its use in the winning solution of the Higgs Machine Learning Challenge. Soon after, the Python and R packages were built, and XGBoost now has package implementations for Java, Scala, Julia, Perl, and other languages. This brought the library to more developers and contributed to its popularity among the Kaggle community, where it has been used for a large number of competitions. [11]

It was soon integrated with a number of other packages making it easier to use in their respective communities. It has now been integrated with scikit-learn for Python users and with the caret package for R users. It can also be integrated into Data Flow frameworks like Apache Spark, Apache Hadoop, and Apache Flink using the abstracted Rabit [13] and XGBoost4J. [14] XGBoost is also available on OpenCL for FPGAs. [15] An efficient, scalable implementation of XGBoost has been published by Tianqi Chen and Carlos Guestrin. [16]

While the XGBoost model often achieves higher accuracy than a single decision tree, it sacrifices the intrinsic interpretability of decision trees.  For example, following the path that a decision tree takes to make its decision is trivial and self-explained, but following the paths of hundreds or thousands of trees is much harder.

Features

Salient features of XGBoost which make it different from other gradient boosting algorithms include: [17] [18] [19] [16]

The algorithm

XGBoost works as Newton-Raphson in function space unlike gradient boosting that works as gradient descent in function space, a second order Taylor approximation is used in the loss function to make the connection to Newton Raphson method.

A generic unregularized XGBoost algorithm is:

Input: training set , a differentiable loss function , a number of weak learners and a learning rate .

Algorithm:

  1. Initialize model with a constant value:
  2. For m = 1 to M:
    1. Compute the 'gradients' and 'hessians':
    2. Fit a base learner (or weak learner, e.g. tree) using the training set by solving the optimization problem below:
    3. Update the model:
  3. Output

Awards

See also

Related Research Articles

<span class="mw-page-title-main">Gradient</span> Multivariate derivative (mathematics)

In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field whose value at a point gives the direction and the rate of fastest increase. The gradient transforms like a vector under change of basis of the space of variables of . If the gradient of a function is non-zero at a point , the direction of the gradient is the direction in which the function increases most quickly from , and the magnitude of the gradient is the rate of increase in that direction, the greatest absolute directional derivative. Further, a point where the gradient is the zero vector is known as a stationary point. The gradient thus plays a fundamental role in optimization theory, where it is used to minimize a function by gradient descent. In coordinate-free terms, the gradient of a function may be defined by:

<span class="mw-page-title-main">Potential flow</span> Velocity field as the gradient of a scalar function

In fluid dynamics, potential flow is the ideal flow pattern of an inviscid fluid. Potential flows are described and determined by mathematical methods.

In continuum mechanics, the infinitesimal strain theory is a mathematical approach to the description of the deformation of a solid body in which the displacements of the material particles are assumed to be much smaller than any relevant dimension of the body; so that its geometry and the constitutive properties of the material at each point of space can be assumed to be unchanged by the deformation.

In the calculus of variations, a field of mathematical analysis, the functional derivative relates a change in a functional to a change in a function on which the functional depends.

In physics, chemistry and biology, a potential gradient is the local rate of change of the potential with respect to displacement, i.e. spatial derivative, or gradient. This quantity frequently occurs in equations of physical processes because it leads to some form of flux.

This is a list of some vector calculus formulae for working with common curvilinear coordinate systems.

In mathematics, a change of variables is a basic technique used to simplify problems in which the original variables are replaced with functions of other variables. The intent is that when expressed in new variables, the problem may become simpler, or equivalent to a better understood problem.

The following are important identities involving derivatives and integrals in vector calculus.

The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, in which case the test and its set of critical values is distribution-free. However, the test is most often used in contexts where a family of distributions is being tested, in which case the parameters of that family need to be estimated and account must be taken of this in adjusting either the test-statistic or its critical values. When applied to testing whether a normal distribution adequately describes a set of data, it is one of the most powerful statistical tools for detecting most departures from normality. K-sample Anderson–Darling tests are available for testing whether several collections of observations can be modelled as coming from a single population, where the distribution function does not have to be specified.

An external ray is a curve that runs from infinity toward a Julia or Mandelbrot set. Although this curve is only rarely a half-line (ray) it is called a ray because it is an image of a ray.

<span class="mw-page-title-main">Oblate spheroidal coordinates</span> Three-dimensional orthogonal coordinate system

Oblate spheroidal coordinates are a three-dimensional orthogonal coordinate system that results from rotating the two-dimensional elliptic coordinate system about the non-focal axis of the ellipse, i.e., the symmetry axis that separates the foci. Thus, the two foci are transformed into a ring of radius in the x-y plane. Oblate spheroidal coordinates can also be considered as a limiting case of ellipsoidal coordinates in which the two largest semi-axes are equal in length.

<span class="mw-page-title-main">Eddy diffusion</span> Mixing of fluids due to eddy currents

In fluid dynamics, eddy diffusion, eddy dispersion, or turbulent diffusion is a process by which fluid substances mix together due to eddy motion. These eddies can vary widely in size, from subtropical ocean gyres down to the small Kolmogorov microscales, and occur as a result of turbulence. The theory of eddy diffusion was first developed by Sir Geoffrey Ingram Taylor.

Discriminative models, also referred to as conditional models, are a class of logistical models used for classification or regression. They distinguish decision boundaries through observed data, such as pass/fail, win/lose, alive/dead or healthy/sick.

The Dirac bracket is a generalization of the Poisson bracket developed by Paul Dirac to treat classical systems with second class constraints in Hamiltonian mechanics, and to thus allow them to undergo canonical quantization. It is an important part of Dirac's development of Hamiltonian mechanics to elegantly handle more general Lagrangians; specifically, when constraints are at hand, so that the number of apparent variables exceeds that of dynamical ones. More abstractly, the two-form implied from the Dirac bracket is the restriction of the symplectic form to the constraint surface in phase space.

Gradient boosting is a machine learning technique based on boosting in a functional space, where the target is pseudo-residuals rather than the typical residuals used in traditional boosting. It gives a prediction model in the form of an ensemble of weak prediction models, i.e., models that make very few assumptions about the data, which are typically simple decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest. A gradient-boosted trees model is built in a stage-wise fashion as in other boosting methods, but it generalizes the other methods by allowing optimization of an arbitrary differentiable loss function.

Augmented Lagrangian methods are a certain class of algorithms for solving constrained optimization problems. They have similarities to penalty methods in that they replace a constrained optimization problem by a series of unconstrained problems and add a penalty term to the objective, but the augmented Lagrangian method adds yet another term designed to mimic a Lagrange multiplier. The augmented Lagrangian is related to, but not identical with, the method of Lagrange multipliers.

<span class="mw-page-title-main">Loss functions for classification</span> Concept in machine learning

In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems. Given as the space of all possible inputs, and as the set of labels, a typical goal of classification algorithms is to find a function which best predicts a label for a given input . However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same to generate different . As a result, the goal of the learning problem is to minimize expected loss, defined as

In mathematics, a harmonic morphism is a (smooth) map between Riemannian manifolds that pulls back real-valued harmonic functions on the codomain to harmonic functions on the domain. Harmonic morphisms form a special class of harmonic maps i.e. those that are horizontally (weakly) conformal.

<span class="mw-page-title-main">Phase contrast magnetic resonance imaging</span>

Phase contrast magnetic resonance imaging (PC-MRI) is a specific type of magnetic resonance imaging used primarily to determine flow velocities. PC-MRI can be considered a method of Magnetic Resonance Velocimetry. It also provides a method of magnetic resonance angiography. Since modern PC-MRI is typically time-resolved, it provides a means of 4D imaging.

Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

References

  1. "Release 2.0.3". 19 December 2023. Retrieved 19 December 2023.
  2. "GitHub project webpage". GitHub . June 2022. Archived from the original on 2021-04-01. Retrieved 2016-04-05.
  3. "Python Package Index PYPI: xgboost". Archived from the original on 2017-08-23. Retrieved 2016-08-01.
  4. "CRAN package xgboost". Archived from the original on 2018-10-26. Retrieved 2016-08-01.
  5. "Julia package listing xgboost". Archived from the original on 2016-08-18. Retrieved 2016-08-01.
  6. "CPAN module AI::XGBoost". Archived from the original on 2020-03-28. Retrieved 2020-02-09.
  7. "Installing XGBoost for Anaconda in Windows". IBM . Archived from the original on 2018-05-08. Retrieved 2016-08-01.
  8. "Installing XGBoost on Mac OSX". IBM . Archived from the original on 2018-05-08. Retrieved 2016-08-01.
  9. "Dask Homepage". Archived from the original on 2022-09-14. Retrieved 2021-07-15.
  10. "Distributed XGBoost with Dask — xgboost 1.5.0-dev documentation". xgboost.readthedocs.io. Archived from the original on 2022-06-04. Retrieved 2021-07-15.
  11. 1 2 "XGBoost - ML winning solutions (incomplete list)". GitHub . Archived from the original on 2017-08-24. Retrieved 2016-08-01.
  12. "Story and Lessons behind the evolution of XGBoost". Archived from the original on 2016-08-07. Retrieved 2016-08-01.
  13. "Rabit - Reliable Allreduce and Broadcast Interface". GitHub . Archived from the original on 2018-06-11. Retrieved 2016-08-01.
  14. "XGBoost4J". Archived from the original on 2018-05-08. Retrieved 2016-08-01.
  15. "XGBoost on FPGAs". GitHub . Archived from the original on 2020-09-13. Retrieved 2019-08-01.
  16. 1 2 Chen, Tianqi; Guestrin, Carlos (2016). "XGBoost: A Scalable Tree Boosting System". In Krishnapuram, Balaji; Shah, Mohak; Smola, Alexander J.; Aggarwal, Charu C.; Shen, Dou; Rastogi, Rajeev (eds.). Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016. ACM. pp. 785–794. arXiv: 1603.02754 . doi:10.1145/2939672.2939785. ISBN   9781450342322. S2CID   4650265.
  17. Gandhi, Rohith (2019-05-24). "Gradient Boosting and XGBoost". Medium. Archived from the original on 2020-03-28. Retrieved 2020-01-04.
  18. "Boosting algorithm: XGBoost". Towards Data Science. 2017-05-14. Archived from the original on 2022-04-06. Retrieved 2020-01-04.
  19. "Tree Boosting With XGBoost – Why Does XGBoost Win "Every" Machine Learning Competition?". Synced. 2017-10-22. Archived from the original on 2020-03-28. Retrieved 2020-01-04.
  20. "John Chambers Award Previous Winners". Archived from the original on 2017-07-31. Retrieved 2016-08-01.
  21. "HEP meets ML Award". Archived from the original on 2018-05-08. Retrieved 2016-08-01.