Wes McKinney

Last updated

Wes McKinney is an American software developer and businessman. He is the creator and "Benevolent Dictator for Life" (BDFL) of the open-source pandas package for data analysis in the Python programming language, and has also authored three versions of the reference book Python for Data Analysis. [1] [2] He's also the creator of Apache Arrow, a cross-language development platform for in-memory data, and Ibis, a unified Python dataframe API. He was the CEO and founder of technology startup Datapad. He was a software engineer at Two Sigma Investments. He founded Ursa Labs, [3] which, in 2021, became part of Voltron Data. [4] In 2022, it was announced that Voltron Data had raised $110 million. [5]

Contents

Early life and education

McKinney graduated from MIT with a B.S. in Mathematics in 2007. [1] In 2010, he began a Ph.D program in Statistics at Duke University, but went on leave in 2011. [6]

Career

From 2007 to 2010, McKinney researched global macro and credit trading strategies at AQR Capital Management. During his time at AQR Capital, he learned Python and started building what would become pandas. [1] McKinney made the pandas project public in 2009. [6]

McKinney left AQR in 2010 to start a PhD in Statistics at Duke University. He went on leave from Duke in the summer of 2011 to devote more time to developing Pandas, [6] culminating in the writing of Python for Data Analysis in 2012.

In 2012, he co-founded Lambda Foundry Inc. [7]

McKinney co-founded Datapad with Chang She in January 2013, with McKinney as CEO. Datapad developed a data visualization product also on the Python stack targeting enterprise customers. Datapad was acquired by Cloudera in September 2014. [8] [9] McKinney joined the engineering team at Cloudera following the acquisition. He worked on an open-source project called Ibis, incubated within Cloudera Labs, aiming at using Python for big data problems. [10] In 2016, McKinney joined the investment fund Two Sigma Investments to work on Apache Arrow. In 2018, he launched Ursa Labs. [3] In 2023, he joined Posit (formerly RStudio) as a Principal Architect. [11]

Media coverage

McKinney has been interviewed by VentureBeat and others. [12] [13] [14] He frequently gives talks to the Python community. [15] [16]

Related Research Articles

Yacc is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a lookahead left-to-right rightmost derivation (LALR) parser generator, generating a LALR parser based on a formal grammar, written in a notation similar to Backus–Naur form (BNF). Yacc is supplied as a standard utility on BSD and AT&T Unix. GNU-based Linux distributions include Bison, a forward-compatible Yacc replacement.

<span class="mw-page-title-main">NumPy</span> Python library for numerical programming

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. The predecessor of NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other developers. In 2005, Travis Oliphant created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications. NumPy is open-source software and has many contributors. NumPy is a NumFOCUS fiscally sponsored project.

<span class="mw-page-title-main">Doug Cutting</span> American information theorist

Douglass Read Cutting is a software designer, advocate for and creator of open-source search technology. He founded two technology projects, Lucene, and Nutch, with Mike Cafarella. Both projects are now managed through the Apache Software Foundation. Cutting and Cafarella are also the co-founders of Apache Hadoop.

<span class="mw-page-title-main">Apache Solr</span> Open-source enterprise-search platform

Solr is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.

<span class="mw-page-title-main">Apache Avro</span> Open-source remote procedure call framework

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages: one for human editing and another which is more machine-readable based on JSON.

<span class="mw-page-title-main">Hortonworks</span> American software company

Hortonworks was a data software company based in Santa Clara, California that developed and supported open-source software designed to manage big data and associated processing.

pandas (software) Python library for data analysis

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals, as well as a play on the phrase "Python data analysis". Wes McKinney started building what would become Pandas at AQR Capital while he was a researcher there from 2007 to 2010.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

WibiData was a software company that developed big data applications for enterprises to personalize their customer experiences. It developed applications based on open-source technologies Apache Hadoop, Apache Cassandra, Apache HBase, Apache Avro and the Kiji Project. Wibidata was founded under the name Odiago in 2010 by Christophe Bisciglia, Aaron Kimball, and Garrett Wu. Based in San Francisco, California, WibiData was backed by investors such as Canaan Partners, New Enterprise Associates, SV Angel, and Eric Schmidt.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the intricacies of the NoSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; insert and delete rows singly and in bulk; and query data through SQL. Phoenix compiles queries and other statements into native NoSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of NoSQL stores.

<span class="mw-page-title-main">Hilary Mason (entrepreneur)</span> American entrepreneur

Hilary Mason is an American entrepreneur and data scientist. She is the co-founder of the startup Fast Forward Labs.

Jeff Hammerbacher is a data scientist. He was chief scientist and cofounder at Cloudera and later served on the faculty of the Icahn School of Medicine at Mount Sinai.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

<span class="mw-page-title-main">Notebook interface</span> Programming tool blending code and documents

A notebook interface or computational notebook is a virtual notebook environment used for literate programming, a method of writing computer programs. Some notebooks are WYSIWYG environments including executable calculations embedded in formatted documents; others separate calculations and text into separate sections. Notebooks share some goals and features with spreadsheets and word processors but go beyond their limited data models.

Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware. This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.

<span class="mw-page-title-main">Tidyverse</span> Collection of R packages

The tidyverse is a collection of open source packages for the R programming language introduced by Hadley Wickham and his team that "share an underlying design philosophy, grammar, and data structures" of tidy data. Characteristic features of tidyverse packages include extensive use of non-standard evaluation and encouraging piping.

<span class="mw-page-title-main">Apache ORC</span> Column-oriented data storage format

Apache ORC is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink, and Apache Hadoop.

Posit PBC is an open-source data science software company. It is a public-benefit corporation founded by J. J. Allaire, creator of the programming language ColdFusion.

References

  1. 1 2 3 McKinney, Wes (2013). Python for Data Analysis (1st ed.). Sebastopol, Calif.: O'Reilly. ISBN   978-1449319793.
  2. McKinney, Wes (2017). Python for Data Analysis (2nd ed.). Sebastopol, Calif.: O'Reilly. ISBN   978-1491957660.
  3. 1 2 "Announcing Ursa Labs: An innovation lab for open source data science". 19 April 2018.
  4. McKinney, Wes (2021-08-05). "Wes McKinney - Joining Forces for an Arrow-Native Future". wesmckinney.com. Retrieved 2024-02-28.
  5. Miller, Ron (2022-02-17). "Voltron Data grabs $110M to build startup based on Apache Arrow project". TechCrunch. Retrieved 2024-02-28.
  6. 1 2 3 Kopf, Dan. "Meet the man behind the most important tool in data science", Quartz , 8 December 2017. Retrieved on 24 October 2019.
  7. "wesmckinney.com" . Retrieved 26 July 2023.
  8. "Data startup DataPad gets acquired, says it will shut down on Friday". VentureBeat. 29 September 2014. Retrieved 2016-01-10.
  9. "Cloudera Bought Datapad". GigaOm. 30 September 2014. Retrieved 10 January 2016.
  10. "Ibis on Impala: Python at Scale for Data Science - Cloudera Engineering Blog". Cloudera Engineering Blog. Retrieved 2016-01-10. [W]e are excited to announce a new open source project, called Ibis, that will deliver the great Python experience and ecosystem, only at any data and node scale.
  11. "Welcome, Wes!". Posit. 2023-11-06. Retrieved 2024-01-26.
  12. "DataPad emerges to let everyone at your company create and play with charts". VentureBeat. 20 May 2014. Retrieved 2016-01-10.
  13. "Meet Quantopian's Newest Advisor: Wes McKinney". Quantopian Blog. Retrieved 2016-01-10.
  14. "Big data's 4 big Vs: It's our Data Summit highlights - Web Summit Blog". Web Summit Blog. Retrieved 2016-01-10.
  15. "LFPUG: Python in the enterprise + Pandas | Enthought Blog". blog.enthought.com. Retrieved 2016-01-10.
  16. "Big Data Conference - Wes McKinney". O'Reilly Media. Retrieved 10 January 2016.