HOME

TheInfoList



OR:

scikit-learn (formerly scikits.learn and also known as sklearn) is a
free software Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, n ...
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
library A library is a collection of materials, books or media that are accessible for use and not just for display purposes. A library provides physical (hard copies) or digital access (soft copies) materials, and may be a physical location or a vi ...
for the Python
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...
. It features various
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...
, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, ''k''-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and
SciPy SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, ...
. Scikit-learn is a NumFOCUS fiscally sponsored project.


Overview

The scikit-learn project started as scikits.learn, a
Google Summer of Code The Google Summer of Code, often abbreviated to GSoC, is an international annual program in which Google awards stipends to contributors who successfully complete a free and open-source software coding project during the summer. , the program is ...
project by French
data scientist Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a bro ...
David Cournapeau. The name of the project stems from the notion that it is a "SciKit" (SciPy Toolkit), a separately developed and distributed third-party extension to
SciPy SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, ...
. The original
codebase In software development, a codebase (or code base) is a collection of source code used to build a particular software system, application, or software component. Typically, a codebase includes only human-written source code files; thus, a codeb ...
was later rewritten by other developers. In 2010, contributors Fabian Pedregosa, Gaƫl Varoquaux, Alexandre Gramfort and Vincent Michel, from the
French Institute for Research in Computer Science and Automation The National Institute for Research in Digital Science and Technology (Inria) () is a French national research institution focusing on computer science and applied mathematics. It was created under the name ''Institut de recherche en informatiq ...
in
Saclay Saclay () is a commune in the southwestern suburbs of Paris, France. It is located from the centre of Paris. It had a population of 3,067 in 2006. It is best known for the large scientific facility CEA Saclay, mostly dealing with nuclear and pa ...
,
France France (), officially the French Republic ( ), is a country primarily located in Western Europe. It also comprises of Overseas France, overseas regions and territories in the Americas and the Atlantic Ocean, Atlantic, Pacific Ocean, Pac ...
, took leadership of the project and released the first public version of the library on February 1st, 2010. In November 2012, scikit-learn as well as scikit-image, were described as two of the "well-maintained and popular" . In 2019, it was noted that scikit-learn is one of the most popular machine learning libraries on
GitHub GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, cont ...
.


Implementation

scikit-learn is largely written in Python, and uses NumPy extensively for high-performance linear algebra and array operations. Furthermore, some core algorithms are written in
Cython Cython () is a programming language that aims to be a superset of the Python programming language, designed to give C-like performance with code that is written mostly in Python with optional additional C-inspired syntax. Cython is a compiled ...
to improve performance. Support vector machines are implemented by a Cython wrapper around
LIBSVM LIBSVM and LIBLINEAR are two popular open source machine learning libraries, both developed at the National Taiwan University and both written in C++ though with a C API. LIBSVM implements the Sequential minimal optimization (SMO) algorithm ...
; logistic regression and linear support vector machines by a similar wrapper around
LIBLINEAR LIBSVM and LIBLINEAR are two popular open source machine learning libraries, both developed at the National Taiwan University and both written in C++ though with a C API. LIBSVM implements the Sequential minimal optimization (SMO) algorithm fo ...
. In such cases, extending these methods with Python may not be possible. scikit-learn integrates well with many other Python libraries, such as
Matplotlib Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPy ...
and
plotly Plotly is a technical computing company headquartered in Montreal, Quebec, that develops online data analytics and visualization tools. Plotly provides online graphing, analytics, and statistics tools for individuals and collaboration, as well a ...
for plotting, NumPy for array vectorization, Pandas dataframes,
SciPy SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, ...
, and many more.


Version history

scikit-learn was initially developed by David Cournapeau as a Google Summer of Code project in 2007. Later that year, Matthieu Brucher joined the project and started to use it as a part of his thesis work. In 2010,
INRIA The National Institute for Research in Digital Science and Technology (Inria) () is a French national research institution focusing on computer science and applied mathematics. It was created under the name ''Institut de recherche en informatiq ...
, the
French Institute for Research in Computer Science and Automation The National Institute for Research in Digital Science and Technology (Inria) () is a French national research institution focusing on computer science and applied mathematics. It was created under the name ''Institut de recherche en informatiq ...
, got involved and the first public release (v0.1 beta) was published in late January 2010. * August 2013. scikit-learn 0.14 * July 2014. scikit-learn 0.15.0 * March 2015. scikit-learn 0.16.0 * November 2015. scikit-learn 0.17.0 * September 2016. scikit-learn 0.18.0 * July 2017. scikit-learn 0.19.0 * September 2018. scikit-learn 0.20.0 * May 2019. scikit-learn 0.21.0 * December 2019. scikit-learn 0.22.0 *May 2020. scikit-learn 0.23.0 * Jan 2021. scikit-learn 0.24 * September 2021. scikit-learn 1.0


scikit-learn tools

* mlpy * SpaCy * NLTK * Orange *
PyTorch PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open ...
*
TensorFlow TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learnin ...
* Infer.NET *
List of numerical analysis software Listed here are notable end-user computer applications intended for use with numerical or data analysis: Numerical-software packages General-purpose computer algebra systems Interface-oriented Language-oriented Historically signific ...


References


External links

* * {{SciPy ecosystem Data mining and machine learning software Free statistical software Python (programming language) scientific libraries Software using the BSD license