LightGBM

Last updated
LightGBM
Original author(s) Guolin Ke [1] / Microsoft Research
Developer(s) Microsoft and LightGBM contributors [2]
Initial release2016;8 years ago (2016)
Stable release
v4.3.0 [3] / January 15, 2024;22 days ago (2024-01-15)
Repository github.com/microsoft/LightGBM
Written in C++, Python, R, C
Operating system Windows, macOS, Linux
Type Machine learning, gradient boosting framework
License MIT License
Website lightgbm.readthedocs.io

LightGBM, short for light gradient-boosting machine, is a free and open-source distributed gradient-boosting framework for machine learning, originally developed by Microsoft. [4] [5] It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks. The development focus is on performance and scalability.

Contents

Overview

The LightGBM framework supports different algorithms including GBT, GBDT, GBRT, GBM, MART [6] [7] and RF. [8] LightGBM has many of XGBoost's advantages, including sparse optimization, parallel training, multiple loss functions, regularization, bagging, and early stopping. A major difference between the two lies in the construction of trees. LightGBM does not grow a tree level-wise — row by row — as most other implementations do. [9] Instead it grows trees leaf-wise. It chooses the leaf it believes will yield the largest decrease in loss. [10] Besides, LightGBM does not use the widely used sorted-based decision tree learning algorithm, which searches the best split point on sorted feature values, [11] as XGBoost or other implementations do. Instead, LightGBM implements a highly optimized histogram-based decision tree learning algorithm, which yields great advantages on both efficiency and memory consumption. [12] The LightGBM algorithm utilizes two novel techniques called Gradient-Based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) which allow the algorithm to run faster while maintaining a high level of accuracy. [13]

LightGBM works on Linux, Windows, and macOS and supports C++, Python, [14] R, and C#. [15] The source code is licensed under MIT License and available on GitHub. [16]

Gradient-based one-side sampling

Gradient-based one-side sampling (GOSS) is a method that leverages the fact that there is no native weight for data instance in GBDT. Since data instances with different gradients play different roles in the computation of information gain, the instances with larger gradients will contribute more to the information gain. So to retain the accuracy of the information, GOSS keeps the instances with large gradients and randomly drops the instances with small gradients. [13]

Exclusive feature bundling

Exclusive feature bundling (EFB) is a near-lossless method to reduce the number of effective features. In a sparse feature space many features are nearly exclusive, implying they rarely take nonzero values simultaneously. One-hot encoded features are a perfect example of exclusive features. EFB bundles these features, reducing dimensionality to improve efficiency while maintaining a high level of accuracy. The bundle of exclusive features into a single feature is called an exclusive feature bundle. [13]

See also

Related Research Articles

In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Boosting is based on the question posed by Kearns and Valiant : "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a classifier that is only slightly correlated with the true classification. In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

Programming languages can be grouped by the number and types of paradigms supported.

<span class="mw-page-title-main">OpenCV</span> Computer vision library

OpenCV is a library of programming functions mainly for real-time computer vision. Originally developed by Intel, it was later supported by Willow Garage, then Itseez. The library is cross-platform and licensed as free and open-source software under Apache License 2. Starting in 2011, OpenCV features GPU acceleration for real-time operations.

<span class="mw-page-title-main">Orange (software)</span>

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for explorative qualitative data analysis and interactive data visualization.

<span class="mw-page-title-main">Cython</span> Programming language

Cython is a superset of the programming language Python, which allows developers to write Python code that yields performance comparable to that of C.

Data binning, also called data discrete binning or data bucketing, is a data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall into a given small interval, a bin, are replaced by a value representative of that interval, often a central value. It is related to quantization: data binning operates on the abscissa axis while quantization operates on the ordinate axis. Binning is a generalization of rounding.

Gradient boosting is a machine learning technique based on boosting in a functional space, where the target is pseudo-residuals rather than the typical residuals used in traditional boosting. It gives a prediction model in the form of an ensemble of weak prediction models, i.e., models that make very few assumptions about the data, which are typically simple decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest. A gradient-boosted trees model is built in a stage-wise fashion as in other boosting methods, but it generalizes the other methods by allowing optimization of an arbitrary differentiable loss function.

scikit-learn Python library for machine learning

scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Scikit-learn is a NumFOCUS fiscally sponsored project.

Feature engineering or feature extraction or feature discovery is the process of extracting features from raw data to support training a downstream statistical model.

<span class="mw-page-title-main">TensorFlow</span> Machine learning software library

TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks.

The following table compares notable software frameworks, libraries and computer programs for deep learning.

<span class="mw-page-title-main">XGBoost</span> Gradient boosting machine learning library

XGBoost is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It works on Linux, Microsoft Windows, and macOS. From the project description, it aims to provide a "Scalable, Portable and Distributed Gradient Boosting Library". It runs on a single machine, as well as the distributed processing frameworks Apache Hadoop, Apache Spark, Apache Flink, and Dask.

The Open edX platform is the open-source software whose development led to the creation of the edX organization. On June 1, 2013, edX open sourced the platform, naming it Open edX to distinguish it from the organization itself. The source code can be found on GitHub. The platform was originally developed as a research project at MIT, with maintenance transferred to edX in 2012.

The following outline is provided as an overview of and topical guide to machine learning:

<span class="mw-page-title-main">ML.NET</span> Machine learning library

ML.NET is a free software machine learning library for the C# and F# programming languages. It also supports Python models when used together with NimbusML. The preview release of ML.NET included transforms for feature engineering like n-gram creation, and learners to handle binary classification, multi-class classification, and regression tasks. Additional ML tasks like anomaly detection and recommendation systems have since been added, and other approaches like deep learning will be included in future versions.

<span class="mw-page-title-main">Dask (software)</span> Python library for parallel computing

Dask is an open-source Python library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including: Pandas, scikit-learn and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.

scikit-multiflow Machine learning library for data streams in Python

scikit-mutliflow is a free and open source software machine learning library for multi-output/multi-label and stream data written in Python.

<span class="mw-page-title-main">Neural Network Intelligence</span> Microsoft open source library

NNI is a free and open-source AutoML toolkit developed by Microsoft. It is used to automate feature engineering, model compression, neural architecture search, and hyper-parameter tuning.

<span class="mw-page-title-main">CatBoost</span> Yandex open source gradient boosting framework on decision trees

CatBoost is an open-source software library developed by Yandex. It provides a gradient boosting framework which among other features attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm. It works on Linux, Windows, macOS, and is available in Python, R, and models built using catboost can be used for predictions in C++, Java, C#, Rust, Core ML, ONNX, and PMML. The source code is licensed under Apache License and available on GitHub.

References

  1. "Guolin Ke". GitHub .
  2. "microsoft/LightGBM". GitHub. 7 July 2022.
  3. "Releases · microsoft/LightGBM". GitHub.
  4. Brownlee, Jason (March 31, 2020). "Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost".
  5. Kopitar, Leon; Kocbek, Primoz; Cilar, Leona; Sheikh, Aziz; Stiglic, Gregor (July 20, 2020). "Early detection of type 2 diabetes mellitus using machine learning-based prediction models". Scientific Reports. 10 (1): 11981. Bibcode:2020NatSR..1011981K. doi:10.1038/s41598-020-68771-z. PMC   7371679 . PMID   32686721 via www.nature.com.
  6. "Understanding LightGBM Parameters (and How to Tune Them)". neptune.ai. May 6, 2020.
  7. "An Overview of LightGBM". avanwyk. May 16, 2018.
  8. "Parameters — LightGBM 3.0.0.99 documentation". lightgbm.readthedocs.io.
  9. The Gradient Boosters IV: LightGBM – Deep & Shallow
  10. XGBoost, LightGBM, and Other Kaggle Competition Favorites | by Andre Ye | Sep, 2020 | Towards Data Science
  11. Manish, Mehta; Rakesh, Agrawal; Jorma, Rissanen (Nov 24, 2020). "SLIQ: A fast scalable classifier for data mining". International Conference on Extending Database Technology: 18–32. CiteSeerX   10.1.1.89.7734 .
  12. "Features — LightGBM 3.1.0.99 documentation". lightgbm.readthedocs.io.
  13. 1 2 3 Ke, Guolin; Meng, Qi; Finley, Thomas; Wang, Taifeng; Chen, Wei; Ma, Weidong; Ye, Qiwei; Liu, Tie-Yan (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree". Advances in Neural Information Processing Systems. 30.
  14. "lightgbm: LightGBM Python Package". 7 July 2022 via PyPI.
  15. "Microsoft.ML.Trainers.LightGbm Namespace". docs.microsoft.com.
  16. "microsoft/LightGBM". October 6, 2020 via GitHub.

Further reading