Pandas (software)
   HOME

TheInfoList



OR:

pandas is a
software library In computer science, a library is a collection of non-volatile resources used by computer programs, often for software development. These may include configuration data, documentation, help data, message templates, pre-written code and sub ...
written for the
Python programming language Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically-typed and garbage-collected. It supports multiple programming p ...
for data manipulation and
analysis Analysis ( : analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (3 ...
. In particular, it offers
data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, ...
s and operations for manipulating numerical tables and
time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Ex ...
. It is
free software Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, n ...
released under the three-clause BSD license. The name is derived from the term " panel data", an
econometrics Econometrics is the application of statistical methods to economic data in order to give empirical content to economic relationships. M. Hashem Pesaran (1987). "Econometrics," '' The New Palgrave: A Dictionary of Economics'', v. 2, p. 8 p. ...
term for
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
s that include observations over multiple time periods for the same individuals. Its name is a play on the phrase "Python data analysis" itself.
Wes McKinney Wes McKinney is an American software developer and businessman. He is the creator and "Benevolent Dictator for Life" (BDFL) of the open-source pandas package for data analysis in the Python programming language, and has also authored three version ...
started building what would become pandas at
AQR Capital AQR Capital Management (Applied Quantitative Research) is a global investment management firm based in Greenwich, Connecticut, United States. The firm, which was founded in 1998 by Cliff Asness, David Kabiller, John Liew, and Robert Krail, offers ...
while he was a researcher there from 2007 to 2010.


Library features

* Many inbuilt methods available for fast data manipulation made possible with vectorisation * DataFrame
object Object may refer to: General meanings * Object (philosophy), a thing, being, or concept ** Object (abstract), an object which does not exist at any particular time or place ** Physical object, an identifiable collection of matter * Goal, an ...
for multivariate data manipulation with integrated indexing. * Series object for univariate data manipulation with integrated indexing * Tools for reading and writing data between in-memory
data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, ...
s and different
file format A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free. Some file format ...
s. * Data alignment and integrated handling of missing data. * Reshaping and pivoting of data sets. * Label-based slicing, fancy indexing, and subsetting of large data sets. * Data structure column insertion and deletion. * Group by engine allowing split-apply-combine operations on data sets. * Data set merging and joining. * Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure. * Time series-functionality: Date range generation and frequency conversions, moving window
statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...
, moving window
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
s, date shifting and lagging. *Provides data filtration. The library is highly optimized for performance, with critical code paths written in
Cython Cython () is a programming language that aims to be a superset of the Python programming language, designed to give C-like performance with code that is written mostly in Python with optional additional C-inspired syntax. Cython is a compiled ...
or C.


DataFrames

Pandas is mainly used for
data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enc ...
and associated manipulation of tabular data in DataFrames. Pandas allows importing data from various file formats such as
comma-separated values A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separ ...
,
JSON JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other se ...
, Parquet, SQL
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases ...
tables or queries, and
Microsoft Excel Microsoft Excel is a spreadsheet developed by Microsoft for Microsoft Windows, Windows, macOS, Android (operating system), Android and iOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro (comp ...
. Pandas allows various data manipulation operations such as merging, reshaping, selecting, as well as data cleaning, and
data wrangling Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one " raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes ...
features. The development of pandas introduced into Python many comparable features of working with DataFrames that were established in the
R programming language R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners, bioinforma ...
. The pandas library is built upon another library NumPy, which is oriented to efficiently working with arrays instead of the features of working on DataFrames.


History

Developer
Wes McKinney Wes McKinney is an American software developer and businessman. He is the creator and "Benevolent Dictator for Life" (BDFL) of the open-source pandas package for data analysis in the Python programming language, and has also authored three version ...
started working on pandas in 2008 while at
AQR Capital Management AQR Capital Management (Applied Quantitative Research) is a global investment management firm based in Greenwich, Connecticut, United States. The firm, which was founded in 1998 by Cliff Asness, David Kabiller, John Liew, and Robert Krail, offers ...
out of the need for a high performance, flexible tool to perform quantitative analysis on financial data. Before leaving AQR he was able to convince management to allow him to
open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized so ...
the library. Another AQR employee, Chang She, joined the effort in 2012 as the second major contributor to the library. In 2015, pandas signed on as a fiscally sponsored project of NumFOCUS, a 501(c)(3) nonprofit charity in the United States.


Timeline:

* 2008: Development of ''pandas'' started * 2009: ''pandas'' becomes open source * 2012: First edition of ''Python for Data Analysis'' is published * 2015: ''pandas'' becomes a NumFOCUS sponsored project * 2018: First in-person core developer sprint


See also

*
matplotlib Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPy ...
* NumPy * Dask *
SciPy SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, ...
*
R (programming language) R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners, bioinform ...
*
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...
*
statsmodels Statsmodels is a Python package that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available f ...
*
List of numerical analysis software Listed here are notable end-user computer applications intended for use with numerical or data analysis: Numerical-software packages General-purpose computer algebra systems Interface-oriented Language-oriented Historically signific ...


References


Further reading

* * * * *


External links

{{Use dmy dates, date=August 2019 Free statistical software Python (programming language) scientific libraries Software using the BSD license