XGBoost

XGBoost
Developer(s)	The XGBoost Contributors
Initial release	March 27, 2014;10 years ago
Stable release	2.0.3 / 19 December 2023;5 months ago
Repository	github.com/dmlc/xgboost ;
Written in	C++
Operating system	Linux, macOS, Microsoft Windows
Type	Machine learning
License	Apache License 2.0
Website	xgboost.ai

Last updated May 22, 2024

XGBoost^[2] (eXtreme Gradient Boosting) is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python,^[3] R,^[4] Julia,^[5] Perl,^[6] and Scala. It works on Linux, Microsoft Windows,^[7] and macOS.^[8] From the project description, it aims to provide a "Scalable, Portable and Distributed Gradient Boosting (GBM, GBRT, GBDT) Library". It runs on a single machine, as well as the distributed processing frameworks Apache Hadoop, Apache Spark, Apache Flink, and Dask.^[9]^[10]

History

XGBoost initially started as a research project by Tianqi Chen^[12] as part of the Distributed (Deep) Machine Learning Community (DMLC) group. Initially, it began as a terminal application which could be configured using a libsvm configuration file. It became well known in the ML competition circles after its use in the winning solution of the Higgs Machine Learning Challenge. Soon after, the Python and R packages were built, and XGBoost now has package implementations for Java, Scala, Julia, Perl, and other languages. This brought the library to more developers and contributed to its popularity among the Kaggle community, where it has been used for a large number of competitions.^[11]

It was soon integrated with a number of other packages making it easier to use in their respective communities. It has now been integrated with scikit-learn for Python users and with the caret package for R users. It can also be integrated into Data Flow frameworks like Apache Spark, Apache Hadoop, and Apache Flink using the abstracted Rabit^[13] and XGBoost4J.^[14] XGBoost is also available on OpenCL for FPGAs.^[15] An efficient, scalable implementation of XGBoost has been published by Tianqi Chen and Carlos Guestrin.^[16]

While the XGBoost model often achieves higher accuracy than a single decision tree, it sacrifices the intrinsic interpretability of decision trees. For example, following the path that a decision tree takes to make its decision is trivial and self-explained, but following the paths of hundreds or thousands of trees is much harder.

Features

Salient features of XGBoost which make it different from other gradient boosting algorithms include:^[17]^[18]^[19]^[16]

Clever penalization of trees
A proportional shrinking of leaf nodes
Newton Boosting
Extra randomization parameter
Implementation on single, distributed systems and out-of-core computation
Automatic Feature selection ^{[ citation needed ]}
Theoretically justified weighted quantile sketching for efficient computation
Parallel tree structure boosting with sparsity
Efficient cacheable block structure for decision tree training

The algorithm

XGBoost works as Newton-Raphson in function space unlike gradient boosting that works as gradient descent in function space, a second order Taylor approximation is used in the loss function to make the connection to Newton Raphson method.

A generic unregularized XGBoost algorithm is:

Input: training set $\{(x_{i},y_{i})\}_{i=1}^{N}$ , a differentiable loss function $L(y,F(x))$ , a number of weak learners $M$ and a learning rate $\alpha$ .

Algorithm:

Initialize model with a constant value: ${\hat {f}}_{(0)}(x)={\underset {\theta }{\arg \min }}\sum _{i=1}^{N}L(y_{i},\theta ).$
For m = 1 to M:
1. Compute the 'gradients' and 'hessians': ${\begin{aligned}{\hat {g}}_{m}(x_{i})&=\left[{\frac {\partial L(y_{i},f(x_{i}))}{\partial f(x_{i})}}\right]_{f(x)={\hat {f}}_{(m-1)}(x)}.\\{\hat {h}}_{m}(x_{i})&=\left[{\frac {\partial ^{2}L(y_{i},f(x_{i}))}{\partial f(x_{i})^{2}}}\right]_{f(x)={\hat {f}}_{(m-1)}(x)}.\end{aligned}}$
2. Fit a base learner (or weak learner, e.g. tree) using the training set $\left\{x_{i},{\dfrac {{\hat {g}}_{m}(x_{i})}{{\hat {h}}_{m}(x_{i})}}\right\}_{i=1}^{N}$ by solving the optimization problem below: ${\hat {\phi }}_{m}={\underset {\phi \in \mathbf {\Phi } }{\arg \min }}\sum _{i=1}^{N}{\frac {1}{2}}{\hat {h}}_{m}(x_{i})\left[\phi (x_{i})-{\frac {{\hat {g}}_{m}(x_{i})}{{\hat {h}}_{m}(x_{i})}}\right]^{2}.$ ${\hat {f}}_{m}(x)=\alpha {\hat {\phi }}_{m}(x).$
3. Update the model: ${\hat {f}}_{(m)}(x)={\hat {f}}_{(m-1)}(x)+{\hat {f}}_{m}(x).$
Output ${\hat {f}}(x)={\hat {f}}_{(M)}(x)=\sum _{m=0}^{M}{\hat {f}}_{m}(x).$

Awards

John Chambers Award (2016)^[20]
High Energy Physics meets Machine Learning award (HEP meets ML) (2016)^[21]

Related Research Articles

The Mersenne Twister is a general-purpose pseudorandom number generator (PRNG) developed in 1997 by Makoto Matsumoto and Takuji Nishimura. Its name derives from the choice of a Mersenne prime as its period length.

In vector calculus, a conservative vector field is a vector field that is the gradient of some function. A conservative vector field has the property that its line integral is path independent; the choice of path between two points does not change the value of the line integral. Path independence of the line integral is equivalent to the vector field under the line integral being conservative. A conservative vector field is also irrotational; in three dimensions, this means that it has vanishing curl. An irrotational vector field is necessarily conservative provided that the domain is simply connected.

Empirical risk minimization is a principle in statistical learning theory which defines a family of learning algorithms based on evaluating performance over a known and fixed dataset. The core idea is based on an application of the law of large numbers; more specifically, we cannot know exactly how well a predictive algorithm will work in practice because we do not know the true distribution of the data, but we can instead estimate and optimize the performance of the algorithm on a known set of training data. The performance over the known set of training data is referred to as the "empirical risk".

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

An external ray is a curve that runs from infinity toward a Julia or Mandelbrot set. Although this curve is only rarely a half-line (ray) it is called a ray because it is an image of a ray.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

Discriminative models, also referred to as conditional models, are a class of models frequently used for classification. They are typically used to assign labels, such as pass/fail, win/lose, alive/dead or healthy/sick, to existing datapoints.

<span class="mw-page-title-main">Quantile regression</span> Statistics concept

Quantile regression is a type of regression analysis used in statistics and econometrics. Whereas the method of least squares estimates the conditional mean of the response variable across values of the predictor variables, quantile regression estimates the conditional median of the response variable. Quantile regression is an extension of linear regression used when the conditions of linear regression are not met.

The Bregman method is an iterative algorithm to solve certain convex optimization problems involving regularization. The original version is due to Lev M. Bregman, who published it in 1967.

Gradient boosting is a machine learning technique based on boosting in a functional space, where the target is pseudo-residuals rather than the typical residuals used in traditional boosting. It gives a prediction model in the form of an ensemble of weak prediction models, i.e., models that make very few assumptions about the data, which are typically simple decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest. A gradient-boosted trees model is built in a stage-wise fashion as in other boosting methods, but it generalizes the other methods by allowing optimization of an arbitrary differentiable loss function.

Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values.

Augmented Lagrangian methods are a certain class of algorithms for solving constrained optimization problems. They have similarities to penalty methods in that they replace a constrained optimization problem by a series of unconstrained problems and add a penalty term to the objective, but the augmented Lagrangian method adds yet another term designed to mimic a Lagrange multiplier. The augmented Lagrangian is related to, but not identical with, the method of Lagrange multipliers.

Manifold alignment is a class of machine learning algorithms that produce projections between sets of data, given that the original data sets lie on a common manifold. The concept was first introduced as such by Ham, Lee, and Saul in 2003, adding a manifold constraint to the general problem of correlating sets of high-dimensional vectors.

In machine learning, feature hashing, also known as the hashing trick, is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array. In addition to its use for encoding non-numeric values, feature hashing can also be used for dimensionality reduction.

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or when combining classifiers into ensembles.

<span class="mw-page-title-main">Loss functions for classification</span> Concept in machine learning

In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems. Given $as the space of all possible inputs, and as the set of labels, a typical goal of classification algorithms is to find a function which best predicts a label for a given input . However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same to generate different . As a result, the goal of the learning problem is to minimize expected loss, defined as$

Large deformation diffeomorphic metric mapping (LDDMM) is a specific suite of algorithms used for diffeomorphic mapping and manipulating dense imagery based on diffeomorphic metric mapping within the academic discipline of computational anatomy, to be distinguished from its precursor based on diffeomorphic mapping. The distinction between the two is that diffeomorphic metric maps satisfy the property that the length associated to their flow away from the identity induces a metric on the group of diffeomorphisms, which in turn induces a metric on the orbit of shapes and forms within the field of Computational Anatomy. The study of shapes and forms with the metric of diffeomorphic metric mapping is called diffeomorphometry.

Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

<span class="mw-page-title-main">CatBoost</span> Open-source software library developed by Yandex

CatBoost is an open-source software library developed by Yandex. It provides a gradient boosting framework which among other features attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm. It works on Linux, Windows, macOS, and is available in Python, R, and models built using catboost can be used for predictions in C++, Java, C#, Rust, Core ML, ONNX, and PMML. The source code is licensed under Apache License and available on GitHub.

In quantum computing, the variational quantum eigensolver (VQE) is a quantum algorithm for quantum chemistry, quantum simulations and optimization problems. It is a hybrid algorithm that uses both classical computers and quantum computers to find the ground state of a given physical system. Given a guess or ansatz, the quantum processor calculates the expectation value of the system with respect to an observable, often the Hamiltonian, and a classical optimizer is used to improve the guess. The algorithm is based on the variational method of quantum mechanics.

References

↑ "Release 2.0.3". 19 December 2023. Retrieved 19 December 2023.
↑ "GitHub project webpage". GitHub . June 2022. Archived from the original on 2021-04-01. Retrieved 2016-04-05.
↑ "Python Package Index PYPI: xgboost". Archived from the original on 2017-08-23. Retrieved 2016-08-01.
↑ "CRAN package xgboost". Archived from the original on 2018-10-26. Retrieved 2016-08-01.
↑ "Julia package listing xgboost". Archived from the original on 2016-08-18. Retrieved 2016-08-01.
↑ "CPAN module AI::XGBoost". Archived from the original on 2020-03-28. Retrieved 2020-02-09.
↑ "Installing XGBoost for Anaconda in Windows". IBM . Archived from the original on 2018-05-08. Retrieved 2016-08-01.
↑ "Installing XGBoost on Mac OSX". IBM . Archived from the original on 2018-05-08. Retrieved 2016-08-01.
↑ "Dask Homepage". Archived from the original on 2022-09-14. Retrieved 2021-07-15.
↑ "Distributed XGBoost with Dask — xgboost 1.5.0-dev documentation". xgboost.readthedocs.io. Archived from the original on 2022-06-04. Retrieved 2021-07-15.
1 2 "XGBoost - ML winning solutions (incomplete list)". GitHub . Archived from the original on 2017-08-24. Retrieved 2016-08-01.
↑ "Story and Lessons behind the evolution of XGBoost". Archived from the original on 2016-08-07. Retrieved 2016-08-01.
↑ "Rabit - Reliable Allreduce and Broadcast Interface". GitHub . Archived from the original on 2018-06-11. Retrieved 2016-08-01.
↑ "XGBoost4J". Archived from the original on 2018-05-08. Retrieved 2016-08-01.
↑ "XGBoost on FPGAs". GitHub . Archived from the original on 2020-09-13. Retrieved 2019-08-01.
1 2 Chen, Tianqi; Guestrin, Carlos (2016). "XGBoost: A Scalable Tree Boosting System". In Krishnapuram, Balaji; Shah, Mohak; Smola, Alexander J.; Aggarwal, Charu C.; Shen, Dou; Rastogi, Rajeev (eds.). Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016. ACM. pp. 785–794. arXiv: 1603.02754 . doi:10.1145/2939672.2939785. ISBN 9781450342322. S2CID 4650265.
↑ Gandhi, Rohith (2019-05-24). "Gradient Boosting and XGBoost". Medium. Archived from the original on 2020-03-28. Retrieved 2020-01-04.
↑ "Boosting algorithm: XGBoost". Towards Data Science. 2017-05-14. Archived from the original on 2022-04-06. Retrieved 2020-01-04.
↑ "Tree Boosting With XGBoost – Why Does XGBoost Win "Every" Machine Learning Competition?". Synced. 2017-10-22. Archived from the original on 2020-03-28. Retrieved 2020-01-04.
↑ "John Chambers Award Previous Winners". Archived from the original on 2017-07-31. Retrieved 2016-08-01.
↑ "HEP meets ML Award". Archived from the original on 2018-05-08. Retrieved 2016-08-01.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[wikidata-3861e61282ffd6481d3f0ea5b7a7313c419439e5-v11-1] "Release 2.0.3". 19 December 2023. Retrieved 19 December 2023.

[source-code-2] "GitHub project webpage". GitHub . June 2022. Archived from the original on 2021-04-01. Retrieved 2016-04-05.

[xgboost-python-3] "Python Package Index PYPI: xgboost". Archived from the original on 2017-08-23. Retrieved 2016-08-01.

[xgboost-cran-4] "CRAN package xgboost". Archived from the original on 2018-10-26. Retrieved 2016-08-01.

[xgboost-julia-5] "Julia package listing xgboost". Archived from the original on 2016-08-18. Retrieved 2016-08-01.

[xgboost-perl-6] "CPAN module AI::XGBoost". Archived from the original on 2020-03-28. Retrieved 2020-02-09.

[xgboost-windows-7] "Installing XGBoost for Anaconda in Windows". IBM . Archived from the original on 2018-05-08. Retrieved 2016-08-01.

[xgboost-macos-8] "Installing XGBoost on Mac OSX". IBM . Archived from the original on 2018-05-08. Retrieved 2016-08-01.

[Dask-docs-9] "Dask Homepage". Archived from the original on 2022-09-14. Retrieved 2021-07-15.

[10] "Distributed XGBoost with Dask — xgboost 1.5.0-dev documentation". xgboost.readthedocs.io. Archived from the original on 2022-06-04. Retrieved 2021-07-15.

[xgboost-competition-winners-11] 1 2 "XGBoost - ML winning solutions (incomplete list)". GitHub . Archived from the original on 2017-08-24. Retrieved 2016-08-01.

[history-12] "Story and Lessons behind the evolution of XGBoost". Archived from the original on 2016-08-07. Retrieved 2016-08-01.

[rabit-13] "Rabit - Reliable Allreduce and Broadcast Interface". GitHub . Archived from the original on 2018-06-11. Retrieved 2016-08-01.

[xgboost4j-14] "XGBoost4J". Archived from the original on 2018-05-08. Retrieved 2016-08-01.

[xgboost_FPGA-15] "XGBoost on FPGAs". GitHub . Archived from the original on 2020-09-13. Retrieved 2019-08-01.

[paper-16] 1 2 Chen, Tianqi; Guestrin, Carlos (2016). "XGBoost: A Scalable Tree Boosting System". In Krishnapuram, Balaji; Shah, Mohak; Smola, Alexander J.; Aggarwal, Charu C.; Shen, Dou; Rastogi, Rajeev (eds.). Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016. ACM. pp. 785–794. arXiv: 1603.02754 . doi:10.1145/2939672.2939785. ISBN 9781450342322. S2CID 4650265.

[17] Gandhi, Rohith (2019-05-24). "Gradient Boosting and XGBoost". Medium. Archived from the original on 2020-03-28. Retrieved 2020-01-04.

[18] "Boosting algorithm: XGBoost". Towards Data Science. 2017-05-14. Archived from the original on 2022-04-06. Retrieved 2020-01-04.

[19] "Tree Boosting With XGBoost – Why Does XGBoost Win "Every" Machine Learning Competition?". Synced. 2017-10-22. Archived from the original on 2020-03-28. Retrieved 2020-01-04.

[john-chambers-20] "John Chambers Award Previous Winners". Archived from the original on 2017-07-31. Retrieved 2016-08-01.

[hep-meets-ml-21] "HEP meets ML Award". Archived from the original on 2018-05-08. Retrieved 2016-08-01.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]