scikit-learn (formerly scikits.learn and also known as sklearn) is a
free and open-source
Free and open-source software (FOSS) is software available under a Software license, license that grants users the right to use, modify, and distribute the software modified or not to everyone free of charge. FOSS is an inclusive umbrella term ...
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
library
A library is a collection of Book, books, and possibly other Document, materials and Media (communication), media, that is accessible for use by its members and members of allied institutions. Libraries provide physical (hard copies) or electron ...
for the
Python programming language
A programming language is a system of notation for writing computer programs.
Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
.
It features various
classification
Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
,
regression and
clustering algorithms
In mathematics and computer science, an algorithm () is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for per ...
including
support-vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning, supervised Maximum-margin hyperplane, max-margin models with associated learning algorithms that analyze data for Statistical classification ...
s,
random forests,
gradient boosting,
''k''-means and
DBSCAN, and is designed to interoperate with the
Python numerical and scientific libraries
NumPy and
SciPy. Scikit-learn is a
NumFOCUS fiscally sponsored project.
Overview
The scikit-learn project started as scikits.learn, a
Google Summer of Code project by French
data scientist David Cournapeau. The name of the project derives from its role as a "scientific toolkit for machine learning", originally developed and distributed as a third-party extension to
SciPy. The original
codebase
In software development, a codebase (or code base) is a collection of source code used to build a particular software system, application, or software component. Typically, a codebase includes only human-written source code system files; thu ...
was later rewritten by other
developers. In 2010, contributors Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort and Vincent Michel, from the
French Institute for Research in Computer Science and Automation in
Saclay,
France
France, officially the French Republic, is a country located primarily in Western Europe. Overseas France, Its overseas regions and territories include French Guiana in South America, Saint Pierre and Miquelon in the Atlantic Ocean#North Atlan ...
, took leadership of the project and released the first public version of the library on February 1, 2010. In November 2012, scikit-learn as well as
scikit-image were described as two of the "well-maintained and popular" . In 2019, it was noted that scikit-learn is one of the most popular machine learning libraries on
GitHub
GitHub () is a Proprietary software, proprietary developer platform that allows developers to create, store, manage, and share their code. It uses Git to provide distributed version control and GitHub itself provides access control, bug trackin ...
.
Features
* Large catalogue of well-established machine learning algorithms and data pre-processing methods (i.e.
feature engineering)
* Utility methods for common data-science tasks, such as splitting data into
train and test sets,
cross-validation and
grid search
* Consistent way of running machine learning models ( and ), which libraries can implement
* Declarative way of structuring a data science process (the ), including data pre-processing and model fitting
Examples
Fitting a
random forest classifier:
>>> from sklearn.ensemble import RandomForestClassifier
>>> classifier = RandomForestClassifier(random_state=0)
>>> X = 1, 2, 3 # 2 samples, 3 features
... 1, 12, 13
>>> y = , 1 # classes of each sample
>>> classifier.fit(X, y)
RandomForestClassifier(random_state=0)
Implementation
scikit-learn is largely written in Python, and uses
NumPy extensively for high-performance linear algebra and array operations. Furthermore, some core algorithms are written in
Cython to improve performance. Support vector machines are implemented by a Cython wrapper around
LIBSVM; logistic regression and linear support vector machines by a similar wrapper around
LIBLINEAR. In such cases, extending these methods with Python may not be possible.
scikit-learn integrates well with many other Python libraries, such as
Matplotlib and
plotly for plotting,
NumPy for array vectorization,
Pandas dataframes,
SciPy, and many more.
Version history
scikit-learn was initially developed by David Cournapeau as a Google Summer of Code project in 2007. Later that year, Matthieu Brucher joined the project and started to use it as a part of his thesis work. In 2010,
INRIA, the
French Institute for Research in Computer Science and Automation, got involved and the first public release (v0.1 beta) was published in late January 2010.
* August 2013. scikit-learn 0.14
* July 2014. scikit-learn 0.15.0
* March 2015. scikit-learn 0.16.0
* November 2015. scikit-learn 0.17.0
* September 2016. scikit-learn 0.18.0
* July 2017. scikit-learn 0.19.0
* September 2018. scikit-learn 0.20.0
* May 2019. scikit-learn 0.21.0
* December 2019. scikit-learn 0.22
* May 2020. scikit-learn 0.23.0
* Jan 2021. scikit-learn 0.24
* September 2021. scikit-learn 1.0.0
* October 2021. scikit-learn 1.0.1
* December 2021. scikit-learn 1.0.2
* May 2022. scikit-learn 1.1.0
* May 2022. scikit-learn 1.1.1
* August 2022. scikit-learn 1.1.2
* October 2022. scikit-learn 1.1.3
* December 2022. scikit-learn 1.2.0
* January 2023. scikit-learn 1.2.1
* March 2023. scikit-learn 1.2.2
Awards
* 2019 Inria-French Academy of Sciences-Dassault Systèmes Innovation Prize
* 2022 Open Science Award for Open Source Research Software
scikit-learn alternatives
*
mlpy
*
SpaCy
*
NLTK
*
Orange
*
PyTorch
*
TensorFlow
*
JAX
*
Infer.NET
*
List of numerical analysis software
References
External links
*
*
{{differentiable computing
Data mining and machine learning software
Free statistical software
Python (programming language) scientific libraries
Software using the BSD license