HOME

TheInfoList



OR:

CatBoost is an
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
software library In computer science, a library is a collection of non-volatile resources used by computer programs, often for software development. These may include configuration data, documentation, help data, message templates, pre-written code and subr ...
developed by
Yandex Yandex LLC (russian: link=no, Яндекс, p=ˈjandəks) is a Russian multinational technology company providing Internet-related products and services, including an Internet search engine, information services, e-commerce, transportation, maps ...
. It provides a
gradient boosting Gradient boosting is a machine learning technique used in regression and classification tasks, among others. It gives a prediction model in the form of an ensemble of weak prediction models, which are typically decision trees. When a decision t ...
framework which among other features attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm. It works on
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
,
Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...
,
macOS macOS (; previously OS X and originally Mac OS X) is a Unix operating system developed and marketed by Apple Inc. since 2001. It is the primary operating system for Apple's Mac computers. Within the market of desktop and lapt ...
, and is available in
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
, R, and models built using catboost can be used for predictions in
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
,
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
, C#,
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH ...
,
Core ML iOS 11 is the iOS version history, eleventh major release of the iOS mobile operating system developed by Apple Inc., being the successor to iOS 10. It was announced at the company's Apple Worldwide Developers Conference, Worldwide Developers C ...
,
ONNX The Open Neural Network Exchange (ONNX) [] is an Open-source software, open-source artificial intelligence ecosystem of technology companies and research organizations that establish open standards for representing machine learning algorithms and ...
, and
PMML The Predictive Model Markup Language (PMML) is an XML-based predictive model interchange format conceived by Dr. Robert Lee Grossman, then the director of the National Center for Data Mining at the University of Illinois at Chicago. PMML provi ...
. The source code is licensed under Apache License and available on GitHub. ''
InfoWorld ''InfoWorld'' (abbreviated IW) is an information technology media business. Founded in 1978, it began as a monthly magazine. In 2007, it transitioned to a web-only publication. Its parent company today is International Data Group, and its siste ...
'' magazine awarded the library "The best machine learning tools" in 2017. along with
TensorFlow TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learning ...
,
Pytorch PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open ...
, XGBoost and 8 other libraries.
Kaggle Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with othe ...
listed CatBoost as one of the most frequently used Machine Learning (ML) frameworks in the world. It was listed as the top-8 most frequently used ML framework in the 2020 survey and as the top-7 most frequently used ML framework in the 2021 survey. As of April 2022, CatBoost is installed about 100000 times per day from
PyPI The Python Package Index, abbreviated as PyPI () and also known as the Cheese Shop (a reference to the ''Monty Python's Flying Circus'' sketch " Cheese Shop"), is the official third-party software repository for Python. It is analogous to the CP ...
repository


Features

CatBoost has gained popularity compared to other gradient boosting algorithms primarily due to the following features * Native handling for categorical features * Fast GPU training * Visualizations and tools for model and feature analysis * Using Oblivious Trees or Symmetric Trees for faster execution * Ordered Boosting to overcome overfitting


History

In 2009 Andrey Gulin, developed
MatrixNet MatrixNet is a proprietary machine learning algorithm developed by Yandex and used widely throughout the company products. The algorithm is based on gradient boosting Gradient boosting is a machine learning technique used in regression and class ...
, a proprietary gradient boosting library that was used in Yandex to rank search results. Since 2009 MatrixNet has been used in different projects in Yandex, including recommendation systems and weather prediction. In 2014-2015 Andrey Gulin with a team of researchers has started a new project called Tensornet that was aimed at solving the problem of "how to work with
categorical data In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or ...
". It resulted in several proprietary Gradient Boosting libraries with different approaches to handling categorical data. In 2016 Machine Learning Infrastructure team led by Anna Dorogush started working on Gradient Boosting in Yandex, including Matrixnet and Tensornet. They implemented and open-sourced the next version of Gradient Boosting library called CatBoost, which has support of categorical and text data, GPU training, model analysis, visualisation tools. CatBoost was open-sourced in July 2017 and is under active development in Yandex and the open-source community.


Application

*
JetBrains JetBrains s.r.o. (formerly IntelliJ Software s.r.o.) is a Czech software development company which makes tools for software developers and project managers. , the company has offices in Prague; Munich; Berlin; Boston, Massachusetts; Amsterda ...
uses CatBoost for Code Completion *
Cloudflare Cloudflare, Inc. is an American content delivery network and DDoS mitigation company, founded in 2009. It primarily acts as a reverse proxy between a website's visitor and the Cloudflare customer's hosting provider. Its headquarters are in San ...
uses CatBoost for bot detection *
Careem Careem is a Dubai-based super app with operations in over 100 cities, covering 12 countries across the Middle East, Africa, and South Asia regions. The company, which was valued at over billion in 2018, became a wholly-owned subsidiary of Uber ...
uses CatBoost to predict future destinations of the rides


See also

*
Machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
*
Gradient boosting Gradient boosting is a machine learning technique used in regression and classification tasks, among others. It gives a prediction model in the form of an ensemble of weak prediction models, which are typically decision trees. When a decision t ...
* XGBoost *
LightGBM LightGBM, short for light gradient-boosting machine, is a free and open-source distributed gradient-boosting framework for machine learning, originally developed by Microsoft. It is based on decision tree algorithms and used for ranking, classifi ...
*
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector m ...


References

{{Reflist


External links


CatBoost

GitHub - catboost/catboost

CatBoost - Yandex Technology
Applications of artificial intelligence Free and open-source software Open-source artificial intelligence Software using the Apache license Python (programming language) scientific libraries Applied machine learning Data mining and machine learning software Free data analysis software Free software programmed in C++ 2017 software Yandex software