List of important publications in data science

Last updated May 22, 2024

This is a list of important publications in data science, generally organized by order of use in a data analysis workflow.

Papers from notable practitioners or notable professors, either with a Wikipedia page or reference to their notability
Common knowledge all data professionals should know
Highly cited applied statistics and machine learning publications
Discussion-facilitating papers on the field of data science as a whole (for example, the Attention Is All You Need paper is arguably a landmark paper^[1] that can be added here, but it is specific to generative artificial intelligence, not for all practitioners of data)

Some reasons why a particular publication might be regarded as important:

Topic creator– A publication that created a new topic
Breakthrough– A publication that changed scientific knowledge significantly
Influence– A publication which has significantly influenced the world or has had a massive impact on the teaching of data science.

When possible, a reference is used to validate the inclusion of the publication in this list.

History

Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)

Author: Leo Breiman

Publication data:^[2]

Online version: https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling--The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.pdf

Description: Describes two cultures of statistics, one using a parsimonious and generative stochastic model, while the other is an algorithmic model with no known mechanism for how the data is generated. Breiman argues that while statistics has traditionally favored using the stochastic model, there is value in expanding the methods that statisticians can use to study phenomenon.

Importance: Influence on the philosophies of statisticians right before the increased use of machine learning and deep learning methods. In a 20-year retrospective on this article, "Breiman's words are perhaps more relevant than ever".^[3] Notable statisticians at the time wrote opinion pieces about the publication. Although overall critical of the publication, David Cox writes that the publication "contains enough truth and exposes enough weaknesses to be thought-provoking."^[2] Bradley Efron commented that this publication is a "stimulating paper".^[2] Emanuel Parzen also comments about this publication that "Breiman alerts us to systematic blunders (leading to wrong conclusions) that have been committed applying current statistical practice of data modeling".^[2]

50 Years of Data Science

Author: David Donoho

Publication data:^[4]

Online version: https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734

Description: Retrospective discussion paper on the history and origins of data science, with a number of commentary from notable statisticians.

Importance: This has been described as "the first in the field to present such a comprehensive and in-depth survey and overview",^[5] and helps to define the field that has many definitions.

The Composable Data Management System Manifesto

Author: Pedro Pedreira, Orri Erling, Konstantinos Karanasos, Scott Schneider, Wes McKinney, Satya R Valluri, Mohamed Zait, Jacques Nadeau

Publication data:^[6]

Online version: https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf

Description: The vision paper advocating for a paradigm shift in how data management systems are designed using standard, composable, interoperable tools rather than siloed software tools.

Importance: A paradigm shifting view on how future data science software tools should be designed for more efficient workflows, the principles of which "will be especially crucial for addressing fragmentation, improving interoperability, and promoting user-centricity as data ecosystems grow increasingly complex".^[7]

Data collection and organization

Tidy Data

Author: Hadley Wickham

Publication data: ^[8]

Online version: https://www.jstatsoft.org/article/view/v059i10/ https://vita.had.co.nz/papers/tidy-data.pdf

Description: Describes a framework for data cleaning that is summarized in the quote, "each variable is a column, each observation is a row, and each type of observational unit is a table".^[8] This allows a standard data structure for which data analysis tools can be consistently built around.

Importance: Cited over 1,500 times, this effort for tidy data has been described by David Donoho as having "more impact on today’s practice of data analysis than many highly regarded theoretical statistics articles".^[4] In the context of data visualization, this publication is said to support "efficient exploration and prototyping because variables can be assigned different roles in the plot without modifying anything about the original dataset".^[9]

Data Organization in Spreadsheets

Author: Karl W. Broman and Kara H. Woo

Publication data:^[10]

Online version: https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989

Description: This article offers practical recommendations for organizing data in spreadsheets, like Microsoft Excel and Google Sheets, to reduce errors and lower the barrier for later analyses due to limitations in spreadsheets or quirks in the software.

Importance: Influences teaching both data and non-data practitioners to create more analysis-friendly spreadsheets, and has been described to outline "spreadsheet best practices".^[11]

Data visualizations

Quantitative Graphics in Statistics: A Brief History

Author: James R. Beniger and Dorothy L. Robyn

Publication data:^[12]

Online version: https://www.jstor.org/stable/2683467

Description: Outlines history and evolution of quantitative graphics in statistics, going through spatial organization (17th and 18th centuries), discrete comparison (18th and 19th centuries), continuous distribution (19th century), and multivariate distribution and correlation (late 19th and 20th centuries).

Importance: Helps put into perspective for learning data practitioners the recency of graphics that are used. A later publication "Graphical Methods in Statistics" by Stephen Fienberg in 1979 writes that his publication "owes much to the work of Beniger and Robyn".^[13]

Tooling

Hidden Technical Debt in Machine Learning Systems

Author: D. Sculley, Gary Holy, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, Dan Dennison

Publication data:^[14]

Online version: https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf

Description: This paper argues that it is "dangerous to think of [complex machine learning] quick wins as coming for free" and overviews risk factors to account for when implementing a machine learning system.

Importance: All authors worked for Google, article is cited over 1,000 times,^[15] and helped practitioners thinking about quickly implementing a machine learning tool without understanding the long-term maintenance of the tool.

A few useful things to know about machine learning

Author: Pedro Domingos

Publication data:^[16]

Online version: https://dl.acm.org/doi/10.1145/2347736.2347755 https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

Description: The purpose of this paper is to distill inaccessible "folk knowledge" to effectively implement machine learning projects because "machine learning projects take much longer than necessary or wind up producing less-than-ideal results".^[16]

Importance: Cited over 4,000 times^[17] to influence the common set of knowledge for data practitioners using machine learning.^[18]

Teaching data science

The Introductory Statistics Course: A Ptolemaic Curriculum

Author: George W. Cobb^[19]

Publication data:^[20]

Online version: https://escholarship.org/uc/item/6hb3k0nz

Description: This paper argues for a rethinking of how teachers of statistics should structure their introductory statistics courses away from the technical machinery based on the normal distribution and towards simpler alternative methods based on permutations done on computers.

Importance: Cited over 300 times,^[21] this paper influenced teachers of statistics in the 21st century to reconsider teaching the mere mechanics of statistics, while the use of computers can be leveraged for doing more with less.

Related Research Articles

A data set is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the data set in question. The data set lists values for each of the variables, such as for example height and weight of an object, for each member of the data set. Data sets can also consist of a collection of documents or files.

<span class="mw-page-title-main">John Tukey</span> American mathematician

John Wilder Tukey was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distribution, the Tukey test of additivity, and the Teichmüller–Tukey lemma all bear his name. He is also credited with coining the term bit and the first published use of the word software.

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set.

Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one. In the context of machine learning and more generally statistical analysis, this may be the selection of a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice.

<span class="mw-page-title-main">Leo Breiman</span> American statistician

Leo Breiman was a distinguished statistician at the University of California, Berkeley. He was the recipient of numerous honors and awards, and was a member of the United States National Academy of Sciences.

Zoubin Ghahramani FRS is a British-Iranian researcher and Professor of Information Engineering at the University of Cambridge. He holds joint appointments at University College London and the Alan Turing Institute. and has been a Fellow of St John's College, Cambridge since 2009. He was Associate Research Professor at Carnegie Mellon University School of Computer Science from 2003–2012. He was also the Chief Scientist of Uber from 2016 until 2020. He joined Google Brain in 2020 as senior research director. He is also Deputy Director of the Leverhulme Centre for the Future of Intelligence.

<span class="mw-page-title-main">Donald Geman</span> American mathematician

Donald Jay Geman is an American applied mathematician and a leading researcher in the field of machine learning and pattern recognition. He and his brother, Stuart Geman, are very well known for proposing the Gibbs sampler and for the first proof of the convergence of the simulated annealing algorithm, in an article that became a highly cited reference in engineering. He is a professor at the Johns Hopkins University and simultaneously a visiting professor at École Normale Supérieure de Cachan.

Robert Tibshirani is a professor in the Departments of Statistics and Biomedical Data Science at Stanford University. He was a professor at the University of Toronto from 1985 to 1998. In his work, he develops statistical tools for the analysis of complex datasets, most recently in genomics and proteomics.

<span class="mw-page-title-main">Data science</span> Interdisciplinary field of study on deriving knowledge and insights from data

Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.

Bin Yu is a Chinese-American statistician. She is currently Chancellor's Professor in the Departments of Statistics and of Electrical Engineering & Computer Sciences at the University of California, Berkeley.

Jerome Harold Friedman is an American statistician, consultant and Professor of Statistics at Stanford University, known for his contributions in the field of statistics and data mining.

<span class="mw-page-title-main">Hadley Wickham</span> New Zealand statistician

Hadley Alexander Wickham is a New Zealand statistician known for his work on open-source software for the R statistical programming environment. He is the chief scientist at Posit PBC and an adjunct professor of statistics at the University of Auckland, Stanford University, and Rice University. His work includes the data visualisation system ggplot2 and the tidyverse, a collection of R packages for data science based on the concept of tidy data.

James Ralph Beniger was an American historian and sociologist and Professor of Communications and Sociology at the Annenberg School for Communication at the University of Southern California, particularly known for his early work on the history of quantitative graphics in statistics, and his later work on the technological and economic origins of the information society.

<span class="mw-page-title-main">Michael Elad</span> Israeli computer scientist

Michael Elad is a professor of Computer Science at the Technion - Israel Institute of Technology. His work includes fundamental contributions in the field of sparse representations, and deployment of these ideas to algorithms and applications in signal processing, image processing and machine learning.

Cynthia Diane Rudin is an American computer scientist and statistician specializing in machine learning and known for her work in interpretable machine learning. She is the director of the Interpretable Machine Learning Lab at Duke University, where she is a professor of computer science, electrical and computer engineering, statistical science, and biostatistics and bioinformatics. In 2022, she won the Squirrel AI Award for Artificial Intelligence for the Benefit of Humanity from the Association for the Advancement of Artificial Intelligence (AAAI) for her work on the importance of transparency for AI systems in high-risk domains.

Adele Cutler is a statistician known as one of the developers of archetypal analysis and of the random forest technique for ensemble learning. She is a professor of mathematics and statistics at Utah State University.

Roger D. Peng is an author and professor of Statistics and Data Science at the University of Texas at Austin. Peng originally received a Bachelor of Science in Applied Mathematics from Yale University in 1999, before going on to study at the University of California, Los Angeles, where he completed a Master of Science in Statistics in 2001 and a PhD in Statistics in 2003. The focus of his research has been on environmental health, specifically focusing on air pollution and climate change in his research. Peng is also a software engineer who has authored numerous R packages focused on applying statistical methods necessary for a variety of topics. He has also created numerous resources including books, online courses, podcasts, blogs, and other articles to aid those learning data analysis.

Jasjeet "Jas" Singh Sekhon is a data scientist, political scientist, and statistician at Yale University. Sekhon is the Eugene Meyer Professor at Yale University, a fellow of the American Statistical Association, and a fellow of the Society for Political Methodology. Sekhon's primary research interests lie in causal inference, machine learning, and their intersection. He has also published research on their application in various fields including voting behavior, online experimentation, epidemiology, and medicine.

The science-wide author databases of standardized citation indicators is a multidimensional ranking of the world’s scientists produced since 2015 by a team of researchers led by John P. A. Ioannidis at Stanford.

References

↑ "Meet the $4 Billion AI Superstars That Google Lost". Bloomberg. 13 July 2023 – via www.bloomberg.com.
1 2 3 4 Breiman, Leo (1 August 2001). "Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)". Statistical Science. 16 (3). doi:10.1214/ss/1009213726. ISSN 0883-4237.
↑ Raper, Simon (29 January 2020). "Leo Breiman's "Two Cultures"". Significance - Oxford Academic. doi:10.1111/j.1740-9713.2020.01357.x . Retrieved 21 May 2024.
1 2 Donoho, David (2 October 2017). "50 Years of Data Science". Journal of Computational and Graphical Statistics. 26 (4): 745–766. doi:10.1080/10618600.2017.1384734. ISSN 1061-8600.
↑ Cao, Longbing (29 June 2017). "Data Science: A Comprehensive Overview". ACM Computing Surveys. 50 (3): 43:1–43:42. arXiv: 2007.03606 . doi:10.1145/3076253. ISSN 0360-0300.
↑ Pedreira, Pedro; Erling, Orri; Karanasos, Konstantinos; Schneider, Scott; McKinney, Wes; Valluri, Satya R; Zait, Mohamed; Nadeau, Jacques (1 June 2023). "The Composable Data Management System Manifesto". Proceedings of the VLDB Endowment. 16 (10): 2679–2685. doi:10.14778/3603581.3603604. ISSN 2150-8097.
↑ Somrah, Priyanka (18 April 2024). "Distilling The Composable Data Management System Manifesto". Work-Bench. Retrieved 17 May 2024.
1 2 Wickham, Hadley (12 September 2014). "Tidy Data". Journal of Statistical Software. 59 (10): 1–23. doi: 10.18637/jss.v059.i10 . ISSN 1548-7660.
↑ Waskom, Michael (6 April 2021). "seaborn: statistical data visualization". Journal of Open Source Software. 6 (60): 3021. Bibcode:2021JOSS....6.3021W. doi: 10.21105/joss.03021 . ISSN 2475-9066.
↑ Broman, Karl W.; Woo, Kara H. (2 January 2018). "Data Organization in Spreadsheets". The American Statistician. 72 (1): 2–10. doi:10.1080/00031305.2017.1375989. ISSN 0003-1305.
↑ Estaki, Mehrbod; Jiang, Lingjing; Bokulich, Nicholas A.; McDonald, Daniel; González, Antonio; Kosciolek, Tomasz; Martino, Cameron; Zhu, Qiyun; Birmingham, Amanda; Vázquez-Baeza, Yoshiki; Dillon, Matthew R.; Bolyen, Evan; Caporaso, J. Gregory; Knight, Rob (2020). "QIIME 2 Enables Comprehensive End-to-End Analysis of Diverse Microbiome Data and Comparative Studies with Publicly Available Data". Current Protocols in Bioinformatics. 70 (1): e100. doi:10.1002/cpbi.100. ISSN 1934-3396. PMC 9285460 . PMID 32343490.
↑ Beniger, James R.; Robyn, Dorothy L. (1 February 1978). "Quantitative Graphics in Statistics: A Brief History". The American Statistician. 32 (1): 1–11. doi:10.2307/2683467. JSTOR 2683467 – via JSTOR.
↑ Fienberg, Stephen E. (1979). "Graphical Methods in Statistics". The American Statistician. 33 (4): 165. doi:10.2307/2683729.
↑ Sculley, D.; Holt, Gary; Golovin, Daniel; Davydov, Eugene; Phillips, Todd; Ebner, Dietmar; Chaudhary, Vinay; Young, Michael; Crespo, Jean-Francois; Dennison, Dan (7 December 2015). "Hidden technical debt in Machine learning systems". Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS'15. Cambridge, MA, USA: MIT Press: 2503–2511.
↑ Google Scholar references https://scholar.google.com/scholar?cites=2255096949091421445&as_sdt=800005&sciodt=0,15&hl=en
1 2 Domingos, Pedro (1 October 2012). "A few useful things to know about machine learning". Communications of the ACM. 55 (10): 78–87. doi:10.1145/2347736.2347755. ISSN 0001-0782.
↑ Google Scholar references https://scholar.google.com/scholar?cites=4404716649035182981&as_sdt=40005&sciodt=0,10&hl=en&oi=gsb
↑ Burrell, Jenna (1 June 2016). "How the machine 'thinks': Understanding opacity in machine learning algorithms". Big Data & Society. 3 (1): 205395171562251. doi: 10.1177/2053951715622512 . ISSN 2053-9517.
↑ "Remembering George Cobb (1947–2020) | Amstat News". 1 July 2020. Retrieved 21 April 2024.
↑ Cobb, George W (12 October 2007). "The Introductory Statistics Course: A Ptolemaic Curriculum?". Technology Innovations in Statistics Education. 1 (1). doi:10.5070/t511000028. ISSN 1933-4214.
↑ Google Scholar references https://scholar.google.com/scholar?cites=13882980985899619210&as_sdt=800005&sciodt=0,15&hl=en&oi=gsb

External links

Papers and tech blogs by companies sharing their work on data science and machine learning in production.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[bloomberg-1] "Meet the $4 Billion AI Superstars That Google Lost". Bloomberg. 13 July 2023 – via www.bloomberg.com.

[:3-2] 1 2 3 4 Breiman, Leo (1 August 2001). "Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)". Statistical Science. 16 (3). doi:10.1214/ss/1009213726. ISSN 0883-4237.

[3] Raper, Simon (29 January 2020). "Leo Breiman's "Two Cultures"". Significance - Oxford Academic. doi:10.1111/j.1740-9713.2020.01357.x . Retrieved 21 May 2024.

[:1-4] 1 2 Donoho, David (2 October 2017). "50 Years of Data Science". Journal of Computational and Graphical Statistics. 26 (4): 745–766. doi:10.1080/10618600.2017.1384734. ISSN 1061-8600.

[5] Cao, Longbing (29 June 2017). "Data Science: A Comprehensive Overview". ACM Computing Surveys. 50 (3): 43:1–43:42. arXiv: 2007.03606 . doi:10.1145/3076253. ISSN 0360-0300.

[6] Pedreira, Pedro; Erling, Orri; Karanasos, Konstantinos; Schneider, Scott; McKinney, Wes; Valluri, Satya R; Zait, Mohamed; Nadeau, Jacques (1 June 2023). "The Composable Data Management System Manifesto". Proceedings of the VLDB Endowment. 16 (10): 2679–2685. doi:10.14778/3603581.3603604. ISSN 2150-8097.

[7] Somrah, Priyanka (18 April 2024). "Distilling The Composable Data Management System Manifesto". Work-Bench. Retrieved 17 May 2024.

[:2-8] 1 2 Wickham, Hadley (12 September 2014). "Tidy Data". Journal of Statistical Software. 59 (10): 1–23. doi: 10.18637/jss.v059.i10 . ISSN 1548-7660.

[9] Waskom, Michael (6 April 2021). "seaborn: statistical data visualization". Journal of Open Source Software. 6 (60): 3021. Bibcode:2021JOSS....6.3021W. doi: 10.21105/joss.03021 . ISSN 2475-9066.

[10] Broman, Karl W.; Woo, Kara H. (2 January 2018). "Data Organization in Spreadsheets". The American Statistician. 72 (1): 2–10. doi:10.1080/00031305.2017.1375989. ISSN 0003-1305.

[11] Estaki, Mehrbod; Jiang, Lingjing; Bokulich, Nicholas A.; McDonald, Daniel; González, Antonio; Kosciolek, Tomasz; Martino, Cameron; Zhu, Qiyun; Birmingham, Amanda; Vázquez-Baeza, Yoshiki; Dillon, Matthew R.; Bolyen, Evan; Caporaso, J. Gregory; Knight, Rob (2020). "QIIME 2 Enables Comprehensive End-to-End Analysis of Diverse Microbiome Data and Comparative Studies with Publicly Available Data". Current Protocols in Bioinformatics. 70 (1): e100. doi:10.1002/cpbi.100. ISSN 1934-3396. PMC 9285460 . PMID 32343490.

[12] Beniger, James R.; Robyn, Dorothy L. (1 February 1978). "Quantitative Graphics in Statistics: A Brief History". The American Statistician. 32 (1): 1–11. doi:10.2307/2683467. JSTOR 2683467 – via JSTOR.

[13] Fienberg, Stephen E. (1979). "Graphical Methods in Statistics". The American Statistician. 33 (4): 165. doi:10.2307/2683729.

[14] Sculley, D.; Holt, Gary; Golovin, Daniel; Davydov, Eugene; Phillips, Todd; Ebner, Dietmar; Chaudhary, Vinay; Young, Michael; Crespo, Jean-Francois; Dennison, Dan (7 December 2015). "Hidden technical debt in Machine learning systems". Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS'15. Cambridge, MA, USA: MIT Press: 2503–2511.

[15] Google Scholar references https://scholar.google.com/scholar?cites=2255096949091421445&as_sdt=800005&sciodt=0,15&hl=en

[:0-16] 1 2 Domingos, Pedro (1 October 2012). "A few useful things to know about machine learning". Communications of the ACM. 55 (10): 78–87. doi:10.1145/2347736.2347755. ISSN 0001-0782.

[17] Google Scholar references https://scholar.google.com/scholar?cites=4404716649035182981&as_sdt=40005&sciodt=0,10&hl=en&oi=gsb

[18] Burrell, Jenna (1 June 2016). "How the machine 'thinks': Understanding opacity in machine learning algorithms". Big Data & Society. 3 (1): 205395171562251. doi: 10.1177/2053951715622512 . ISSN 2053-9517.

[19] "Remembering George Cobb (1947–2020) | Amstat News". 1 July 2020. Retrieved 21 April 2024.

[20] Cobb, George W (12 October 2007). "The Introductory Statistics Course: A Ptolemaic Curriculum?". Technology Innovations in Statistics Education. 1 (1). doi:10.5070/t511000028. ISSN 1933-4214.

[21] Google Scholar references https://scholar.google.com/scholar?cites=13882980985899619210&as_sdt=800005&sciodt=0,15&hl=en&oi=gsb

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]