CatBoost

CatBoost
Original author(s)	Andrey Gulin: / Yandex
Developer(s)	Yandex and CatBoost Contributors
Initial release	July 18, 2017;6 years ago
Stable release	1.2.3 / February 23, 2024;2 months ago
Written in	Python, R, C++, Java
Operating system	Linux, macOS, Windows
Type	Machine learning
License	Apache License 2.0
Website	catboost.ai

Last updated April 24, 2024

CatBoost^[6] is an open-source software library developed by Yandex. It provides a gradient boosting framework which among other features attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm.^[7] It works on Linux, Windows, macOS, and is available in Python,^[8] R,^[9] and models built using catboost can be used for predictions in C++, Java,^[10] C#, Rust, Core ML, ONNX, and PMML. The source code is licensed under Apache License and available on GitHub.^[6]

Features

CatBoost has gained popularity compared to other gradient boosting algorithms primarily due to the following features^[15]

Native handling for categorical features^[16]
Fast GPU training^[17]
Visualizations and tools for model and feature analysis
Using Oblivious Trees or Symmetric Trees for faster execution
Ordered Boosting to overcome overfitting^[7]

History

In 2009 Andrey Gulin, developed MatrixNet, a proprietary gradient boosting library that was used in Yandex to rank search results. Since 2009 MatrixNet has been used in different projects in Yandex, including recommendation systems and weather prediction.

In 2014–2015 Andrey Gulin with a team of researchers has started a new project called Tensornet that was aimed at solving the problem of "how to work with categorical data". It resulted in several proprietary Gradient Boosting libraries with different approaches to handling categorical data.

In 2016 Machine Learning Infrastructure team led by Anna Dorogush started working on Gradient Boosting in Yandex, including Matrixnet and Tensornet. They implemented and open-sourced the next version of Gradient Boosting library called CatBoost, which has support of categorical and text data, GPU training, model analysis, visualisation tools.

CatBoost was open-sourced in July 2017 and is under active development in Yandex and the open-source community.

Application

JetBrains uses CatBoost for code completion^[18]
Cloudflare uses CatBoost for bot detection^[19]
Careem uses CatBoost to predict future destinations of the rides^[20]

Related Research Articles

OpenCV is a library of programming functions mainly for real-time computer vision. Originally developed by Intel, it was later supported by Willow Garage, then Itseez. The library is cross-platform and licensed as free and open-source software under Apache License 2. Starting in 2011, OpenCV features GPU acceleration for real-time operations.

<span class="mw-page-title-main">Orange (software)</span> Open-source data analysis software

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for explorative qualitative data analysis and interactive data visualization.

scikit-learn is a free and open-source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Scikit-learn is a NumFOCUS fiscally sponsored project.

Torch is an open-source machine learning library, a scientific computing framework, and a scripting language based on Lua. It provides LuaJIT interfaces to deep learning algorithms implemented in C. It was created by the Idiap Research Institute at EPFL. Torch development moved in 2017 to PyTorch, a port of the library to Python.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks.

The following table compares notable software frameworks, libraries and computer programs for deep learning.

XGBoost is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It works on Linux, Microsoft Windows, and macOS. From the project description, it aims to provide a "Scalable, Portable and Distributed Gradient Boosting Library". It runs on a single machine, as well as the distributed processing frameworks Apache Hadoop, Apache Spark, Apache Flink, and Dask.

Keras is an open-source library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library.

Caffe is a deep learning framework, originally developed at University of California, Berkeley. It is open source, under a BSD license. It is written in C++, with a Python interface.

PyTorch is a machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is recognized as one of the two most popular machine learning libraries alongside TensorFlow, offering free and open-source software released under the modified BSD license. Although the Python interface is more polished and the primary focus of development, PyTorch also has a C++ interface.

ROCm is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains: general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), heterogeneous computing. It offers several programming models: HIP, OpenMP/Message Passing Interface (MPI), and OpenCL.

<span class="mw-page-title-main">ML.NET</span> Machine learning library

ML.NET is a free software machine learning library for the C# and F# programming languages. It also supports Python models when used together with NimbusML. The preview release of ML.NET included transforms for feature engineering like n-gram creation, and learners to handle binary classification, multi-class classification, and regression tasks. Additional ML tasks like anomaly detection and recommendation systems have since been added, and other approaches like deep learning will be included in future versions.

In computer vision, SqueezeNet is the name of a deep neural network for image classification that was released in 2016. SqueezeNet was developed by researchers at DeepScale, University of California, Berkeley, and Stanford University. In designing SqueezeNet, the authors' goal was to create a smaller neural network with fewer parameters while achieving competitive accuracy.

Dask is an open-source Python library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including: Pandas, scikit-learn and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.

Kubeflow is an open-source platform for machine learning and MLOps on Kubernetes introduced by Google. The different stages in a typical machine learning lifecycle are represented with different software components in Kubeflow, including model development (Kubeflow Notebooks), model training (Kubeflow Pipelines,Kubeflow Training Operator), model serving (KServe), and automated machine learning (Katib).

DeepSpeed is an open source deep learning optimization library for PyTorch. The library is designed to reduce computing power and memory use and to train large distributed models with better parallelism on existing computer hardware. DeepSpeed is optimized for low latency, high throughput training. It includes the Zero Redundancy Optimizer (ZeRO) for training models with 1 trillion or more parameters. Features include mixed precision training, single-GPU, multi-GPU, and multi-node training as well as custom model parallelism. The DeepSpeed source code is licensed under MIT License and available on GitHub.

NNI is a free and open-source AutoML toolkit developed by Microsoft. It is used to automate feature engineering, model compression, neural architecture search, and hyper-parameter tuning.

LightGBM, short for light gradient-boosting machine, is a free and open-source distributed gradient-boosting framework for machine learning, originally developed by Microsoft. It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks. The development focus is on performance and scalability.

CuPy is an open source library for GPU-accelerated computing with Python programming language, providing support for multi-dimensional arrays, sparse matrices, and a variety of numerical algorithms implemented on top of them. CuPy shares the same API set as NumPy and SciPy, allowing it to be a drop-in replacement to run NumPy/SciPy code on GPU. CuPy supports Nvidia CUDA GPU platform, and AMD ROCm GPU platform starting in v9.0.

References

↑ "Andrey Gulin - People - Research at Yandex". research.yandex.com.
↑ "catboost/catboost". GitHub.
↑ "Yandex open sources CatBoost, a gradient boosting machine learning library". TechCrunch. Retrieved 2020-08-30.
↑ Yegulalp, Serdar (2017-07-18). "Yandex open sources CatBoost machine learning library". InfoWorld. Retrieved 2020-08-30.
↑ "Releases · catboost/catboost". GitHub. Retrieved 2024-03-14.
1 2 "catboost/catboost". August 30, 2020 – via GitHub.
1 2 Prokhorenkova, Liudmila; Gusev, Gleb; Vorobev, Aleksandr; Dorogush, Anna Veronika; Gulin, Andrey (2019-01-20). "CatBoost: unbiased boosting with categorical features". arXiv: 1706.09516 [cs.LG].
↑ "Python Package Index PYPI: catboost" . Retrieved 2020-08-20.
↑ "Conda force package catboost-r" . Retrieved 2020-08-30.
↑ "Maven Repository: ai.catboost » catboost-prediction". mvnrepository.com. Retrieved 2020-08-30.
↑ staff, InfoWorld (27 September 2017). "Bossie Awards 2017: The best machine learning tools". InfoWorld.
↑ "State of Data Science and Machine Learning 2020".
↑ "State of Data Science and Machine Learning 2021".
↑ "PyPI Stats catboost". PyPI Stats.
↑ Joseph, Manu (2020-02-29). "The Gradient Boosters V: CatBoost". Deep & Shallow. Retrieved 2020-08-30.
↑ Dorogush, Anna Veronika; Ershov, Vasily; Gulin, Andrey (2018-10-24). "CatBoost: gradient boosting with categorical features support". arXiv: 1810.11363 [cs.LG].
↑ "CatBoost Enables Fast Gradient Boosting on Decision Trees Using GPUs". NVIDIA Developer Blog. 2018-12-13. Retrieved 2020-08-30.
↑ "Code Completion, Episode 4: Model Training". JetBrains Developer Blog. 2021-08-20.
↑ "Stop the Bots: Practical Lessons in Machine Learning". The Cloudflare Blog. 2019-02-20.
↑ "How Careem's Destination Prediction Service speeds up your ride". Careem. 2019-02-19.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Andrey Gulin - People - Research at Yandex". research.yandex.com.

[catboost-authors-2] "catboost/catboost". GitHub.

[catboost-launch-techcrunch-3] "Yandex open sources CatBoost, a gradient boosting machine learning library". TechCrunch. Retrieved 2020-08-30.

[catboost-launch-infoworld-4] Yegulalp, Serdar (2017-07-18). "Yandex open sources CatBoost machine learning library". InfoWorld. Retrieved 2020-08-30.

[catboost-latest-release-5] "Releases · catboost/catboost". GitHub. Retrieved 2024-03-14.

[source-code-6] 1 2 "catboost/catboost". August 30, 2020 – via GitHub.

[catboost-categorical-handling-7] 1 2 Prokhorenkova, Liudmila; Gusev, Gleb; Vorobev, Aleksandr; Dorogush, Anna Veronika; Gulin, Andrey (2019-01-20). "CatBoost: unbiased boosting with categorical features". arXiv: 1706.09516 [cs.LG].

[catboost-python-8] "Python Package Index PYPI: catboost" . Retrieved 2020-08-20.

[catboost-r-conda-9] "Conda force package catboost-r" . Retrieved 2020-08-30.

[catboost-java-10] "Maven Repository: ai.catboost » catboost-prediction". mvnrepository.com. Retrieved 2020-08-30.

[11] staff, InfoWorld (27 September 2017). "Bossie Awards 2017: The best machine learning tools". InfoWorld.

[12] "State of Data Science and Machine Learning 2020".

[13] "State of Data Science and Machine Learning 2021".

[14] "PyPI Stats catboost". PyPI Stats.

[15] Joseph, Manu (2020-02-29). "The Gradient Boosters V: CatBoost". Deep & Shallow. Retrieved 2020-08-30.

[16] Dorogush, Anna Veronika; Ershov, Vasily; Gulin, Andrey (2018-10-24). "CatBoost: gradient boosting with categorical features support". arXiv: 1810.11363 [cs.LG].

[17] "CatBoost Enables Fast Gradient Boosting on Decision Trees Using GPUs". NVIDIA Developer Blog. 2018-12-13. Retrieved 2020-08-30.

[18] "Code Completion, Episode 4: Model Training". JetBrains Developer Blog. 2021-08-20.

[19] "Stop the Bots: Practical Lessons in Machine Learning". The Cloudflare Blog. 2019-02-20.

[20] "How Careem's Destination Prediction Service speeds up your ride". Careem. 2019-02-19.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]