Pandas (software)

Last updated

Pandas
Original author(s) Wes McKinney
Developer(s) Community
Initial release11 January 2008;16 years ago (2008-01-11)[ citation needed ]
Stable release
2.2.1 [1] / 23 February 2024;53 days ago (23 February 2024)
Preview release
2.0rc1 / 15 March 2023
Repository
Written in Python, Cython, C
Operating system Cross-platform
Type Technical computing
License New BSD License
Website pandas.pydata.org

Pandas (styled as pandas) is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. [2] The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals, [3] as well as a play on the phrase "Python data analysis". [4] :5 Wes McKinney started building what would become Pandas at AQR Capital while he was a researcher there from 2007 to 2010. [5]

Contents

The development of Pandas introduced into Python many comparable features of working with DataFrames that were established in the R programming language. The library is built upon another library, NumPy.

History

Developer Wes McKinney started working on Pandas in 2008 while at AQR Capital Management out of the need for a high performance, flexible tool to perform quantitative analysis on financial data. Before leaving AQR he was able to convince management to allow him to open source the library.

Another AQR employee, Chang She, joined the effort in 2012 as the second major contributor to the library.

In 2015, Pandas signed on as a fiscally sponsored project of NumFOCUS, a 501(c)(3) nonprofit charity in the United States. [6]

Data Model

Pandas is built around data structures called Series and DataFrames. Data for these collections can be imported from various file formats such as comma-separated values, JSON, Parquet, SQL database tables or queries, and Microsoft Excel. [7]

A Series is a 1-dimensional data structure built on top of NumPy's array. [8] :97 Unlike in NumPy, each data point has an associated label. The collection of these labels is called an index. [4] :112 Series can be used arithmetically, as in the statement series_3 = series_1 + series_2: this will align data points with corresponding index values in series_1 and series_2, then add them together to produce new values in series_3. [4] :114 A DataFrame is a 2-dimensional data structure of rows and columns, similar to a spreadsheet, and analogous to a Python dictionary mapping column names (keys) to Series (values), with each Series sharing an index. [4] :115 DataFrames can be concatenated together or "merged" on columns or indices in a manner similar to joins in SQL. [4] :177–182 Pandas implements a subset of relational algebra, and supports one-to-one, many-to-one, and many-to-many joins. [8] :147–148 Pandas also supports the less common Panel and Panel4D, which are 3-dimensional and 4-dimension data structures respectively. [8] :141

Users can transform or summarize data by applying arbitrary functions. [4] :132 Since Pandas is built on top of NumPy, all NumPy functions work on Series and DataFrames as well. [8] :115 Pandas also includes built-in operations for arithmetic, string manipulation, and summary statistics such as mean, median, and standard deviation. [4] :139,211 These built-in functions are designed to handle missing data, usually represented by the floating-point value NaN. [4] :142–143

Subsets of data can be selected by column name, index, or Boolean expressions. For example, df[df['col1'] > 5] will return all rows in the DataFrame df for which the value of the column col1 exceeds 5. [4] :126–128 Data can be grouped together by a column value, as in df['col1'].groupby(df['col2']), or by a function which is applied to the index. For example, df.groupby(lambda i: i % 2) groups data by whether the index is even. [4] :253–259

Pandas includes support for time series, such as the ability to interpolate values [4] :316–317 and filter using a range of timestamps (e.g. data['1/1/2023':'2/2/2023'] will return all dates between January 1st and February 2nd). [4] :295 Pandas represents missing time series data using a special NaT (Not a Timestamp) object, instead of the NaN value it uses elsewhere. [4] :292

Indices

By default, a Pandas index is a series of integers ascending from 0, similar to the indices of Python arrays. However, indices can use any NumPy data type, including floating point, timestamps, or strings. [4] :112

Pandas' syntax for mapping index values to relevant data is the same syntax Python uses to map dictionary keys to values. For example, if s is a Series, s['a'] will return the data point at index a. Unlike dictionary keys, index values are not guaranteed to be unique. If a Series uses the index value a for multiple data points, then s['a'] will instead return a new Series containing all matching values. [4] :136 A DataFrame's column names are stored and implemented identically to an index. As such, a DataFrame can be thought of as having two indices: one column-based and one row-based. Because column names are stored as an index, there are also not required to be unique. [8] :103–105

If data is a Series, then data['a'] returns all values with the index value of a. However, if data is a DataFrame, then data['a'] returns all values in the column(s) named a. To avoid this ambiguity, Pandas supports the syntax data.loc['a'] as an alternative way to filter using the index. Pandas also supports the syntax data.iloc[n], which always takes an integer n and returns the nth value, counting from 0. This allows a user to act as though the index is an array-like sequence of integers, regardless of how it's actually defined. [8] :110–113

Pandas supports hierarchical indices with multiple values per data point. An index with this structure, called a "MultiIndex", allows a single DataFrame to represent multiple dimensions, similar to a pivot table in Microsoft Excel. [4] :147–148 Each level of a MultiIndex can be given a unique name. [8] :133 In practice, data with more than 2 dimensions is often represented using DataFrames with hierarchical indices, instead of the higher-dimension Panel and Panel4D data structures [8] :128

Criticisms

Pandas has been criticized for its inefficiency. Pandas can require 5 to 10 times as much memory as the size of the underlying data, and the entire dataset must be loaded in RAM. The library does not optimize query plans or support parallel computing across multiple cores. Wes McKinney, the creator of Pandas, has recommended Apache Arrow as an alternative to address these performance concerns and other limitations. [9]

See also

Related Research Articles

In computer science, an array is a data structure consisting of a collection of elements, of same memory size, each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula. The simplest type of data structure is a linear array, also called one-dimensional array.

<span class="mw-page-title-main">Python (programming language)</span> General-purpose programming language

Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.

<span class="mw-page-title-main">SciPy</span> Open-source Python library for scientific computing

SciPy is a free and open-source Python library used for scientific computing and technical computing.

<span class="mw-page-title-main">Sparse matrix</span> Matrix in which most of the elements are zero

In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix to qualify as sparse but a common criterion is that the number of non-zero elements is roughly equal to the number of rows or columns. By contrast, if most of the elements are non-zero, the matrix is considered dense. The number of zero-valued elements divided by the total number of elements is sometimes referred to as the sparsity of the matrix.

<span class="mw-page-title-main">NumPy</span> Python library for numerical programming

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. The predecessor of NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other developers. In 2005, Travis Oliphant created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications. NumPy is open-source software and has many contributors. NumPy is a NumFOCUS fiscally sponsored project.

In computer programming, array slicing is an operation that extracts a subset of elements from an array and packages them as another array, possibly in a different dimension from the original.

<span class="mw-page-title-main">Row- and column-major order</span> Array representation in computer memory

In computing, row-major order and column-major order are methods for storing multidimensional arrays in linear storage such as random access memory.

Gadfly is a relational database management system written in Python. Gadfly is a collection of Python modules that provides relational database functionality entirely implemented in Python. It supports a subset of the standard RDBMS Structured Query Language (SQL).

In computer science, clamping, or clipping is the process of limiting a value to a range between a minimum and a maximum value. Unlike wrapping, clamping merely moves the point to the nearest available value.

<span class="mw-page-title-main">IPython</span> Advanced interactive shell for Python

IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, that offers introspection, rich media, shell syntax, tab completion, and history. IPython provides the following features:

Q is a programming language for array processing, developed by Arthur Whitney. It is proprietary software, commercialized by Kx Systems. Q serves as the query language for kdb+, a disk based and in-memory, column-based database. Kdb+ is based on the language k, a terse variant of the language APL. Q is a thin wrapper around k, providing a more readable, English-like interface. One of the use cases is financial time series analysis, as one could do inexact time matches. An example is to match the a bid and the ask before that. Both timestamps slightly differ and are matched anyway.

In computer science, array is a data type that represents a collection of elements, each selected by one or more indices that can be computed at run time during program execution. Such a collection is usually called an array variable or array value. By analogy with the mathematical concepts vector and matrix, array types with one and two indices are often called vector type and matrix type, respectively. More generally, a multidimensional array type can be called a tensor type, by analogy with the physical concept, tensor.


The Wing Python IDE is a family of integrated development environments (IDEs) from Wingware created specifically for the Python programming language, with support for editing, testing, debugging, inspecting/browsing, and error-checking Python code.

Wes McKinney is an American software developer and businessman. He is the creator and "Benevolent Dictator for Life" (BDFL) of the open-source pandas package for data analysis in the Python programming language, and has also authored three versions of the reference book Python for Data Analysis. He was the CEO and founder of technology startup Datapad. He was a software engineer at Two Sigma Investments. He founded Ursa Labs, which, in 2021, became part of Voltron Data. In 2022, it was announced that Voltron Data had raised $110 million.

<span class="mw-page-title-main">Dask (software)</span> Python library for parallel computing

Dask is an open-source Python library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including: Pandas, scikit-learn and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.

CuPy is an open source library for GPU-accelerated computing with Python programming language, providing support for multi-dimensional arrays, sparse matrices, and a variety of numerical algorithms implemented on top of them. CuPy shares the same API set as NumPy and SciPy, allowing it to be a drop-in replacement to run NumPy/SciPy code on GPU. CuPy supports Nvidia CUDA GPU platform, and AMD ROCm GPU platform starting in v9.0.

References

  1. "Pandas 2.2.1". 23 February 2024.
  2. "License – Package overview – pandas 1.0.0 documentation". pandas. 28 January 2020. Retrieved 30 January 2020.
  3. Wes McKinney (2011). "pandas: a Foundational Python Library for Data Analysis and Statistics" (PDF). Retrieved 2 August 2018.
  4. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 McKinney, Wes (2014). Python for Data Analysis (First ed.). O'Reilly. ISBN   978-1-449-31979-3.
  5. Kopf, Dan. "Meet the man behind the most important tool in data science". Quartz. Retrieved 17 November 2020.
  6. "NumFOCUS – pandas: a fiscally sponsored project". NumFOCUS. Retrieved 3 April 2018.
  7. "IO tools (Text, CSV, HDF5, …) — pandas 1.4.1 documentation".
  8. 1 2 3 4 5 6 7 8 VanderPlas, Jake (2016). Python Data Science Handbook: Essential Tools for Working with Data (First ed.). O'Reilly. ISBN   978-1-491-91205-8.
  9. McKinney, Wes (21 September 2017). "Apache Arrow and the "10 Things I Hate About pandas"". wesmckinney.com. Retrieved 21 December 2023.

Further reading