This is a list of important publications in data science, generally organized by order of use in a data analysis workflow.
See the list of important publications in statistics for more research-based and fundamental publications; while this list is more applied, business oriented, and cross-disciplinary.
General article inclusion criteria are:
Some reasons why a particular publication might be regarded as important:
When possible, a reference is used to validate the inclusion of the publication in this list.
Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)
50 Years of Data Science
The Composable Data Management System Manifesto
Tidy Data
Data Organization in Spreadsheets
Quantitative Graphics in Statistics: A Brief History
Hidden Technical Debt in Machine Learning Systems
A few useful things to know about machine learning
The Introductory Statistics Course: A Ptolemaic Curriculum
A data set is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the data set in question. The data set lists values for each of the variables, such as for example height and weight of an object, for each member of the data set. Data sets can also consist of a collection of documents or files.
John Wilder Tukey was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distribution, the Tukey test of additivity, and the Teichmüller–Tukey lemma all bear his name. He is also credited with coining the term bit and the first published use of the word software.
Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set.
Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one. In the context of machine learning and more generally statistical analysis, this may be the selection of a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice.
Leo Breiman was a distinguished statistician at the University of California, Berkeley. He was the recipient of numerous honors and awards, and was a member of the United States National Academy of Sciences.
Zoubin Ghahramani FRS is a British-Iranian researcher and Professor of Information Engineering at the University of Cambridge. He holds joint appointments at University College London and the Alan Turing Institute. and has been a Fellow of St John's College, Cambridge since 2009. He was Associate Research Professor at Carnegie Mellon University School of Computer Science from 2003–2012. He was also the Chief Scientist of Uber from 2016 until 2020. He joined Google Brain in 2020 as senior research director. He is also Deputy Director of the Leverhulme Centre for the Future of Intelligence.
Donald Jay Geman is an American applied mathematician and a leading researcher in the field of machine learning and pattern recognition. He and his brother, Stuart Geman, are very well known for proposing the Gibbs sampler and for the first proof of the convergence of the simulated annealing algorithm, in an article that became a highly cited reference in engineering. He is a professor at the Johns Hopkins University and simultaneously a visiting professor at École Normale Supérieure de Cachan.
Robert Tibshirani is a professor in the Departments of Statistics and Biomedical Data Science at Stanford University. He was a professor at the University of Toronto from 1985 to 1998. In his work, he develops statistical tools for the analysis of complex datasets, most recently in genomics and proteomics.
Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, scientific visualization, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.
Bin Yu is a Chinese-American statistician. She is currently Chancellor's Professor in the Departments of Statistics and of Electrical Engineering & Computer Sciences at the University of California, Berkeley.
Jerome Harold Friedman is an American statistician, consultant and Professor of Statistics at Stanford University, known for his contributions in the field of statistics and data mining.
Hadley Alexander Wickham is a New Zealand statistician known for his work on open-source software for the R statistical programming environment. He is the chief scientist at Posit PBC and an adjunct professor of statistics at the University of Auckland, Stanford University, and Rice University. His work includes the data visualisation system ggplot2 and the tidyverse, a collection of R packages for data science based on the concept of tidy data.
James Ralph Beniger was an American historian and sociologist and Professor of Communications and Sociology at the Annenberg School for Communication at the University of Southern California, particularly known for his early work on the history of quantitative graphics in statistics, and his later work on the technological and economic origins of the information society.
Michael Elad is a professor of Computer Science at the Technion - Israel Institute of Technology. His work includes fundamental contributions in the field of sparse representations, and deployment of these ideas to algorithms and applications in signal processing, image processing and machine learning.
Cynthia Diane Rudin is an American computer scientist and statistician specializing in machine learning and known for her work in interpretable machine learning. She is the director of the Interpretable Machine Learning Lab at Duke University, where she is a professor of computer science, electrical and computer engineering, statistical science, and biostatistics and bioinformatics. In 2022, she won the Squirrel AI Award for Artificial Intelligence for the Benefit of Humanity from the Association for the Advancement of Artificial Intelligence (AAAI) for her work on the importance of transparency for AI systems in high-risk domains.
Adele Cutler is a statistician known as one of the developers of archetypal analysis and of the random forest technique for ensemble learning. She is a professor of mathematics and statistics at Utah State University.
Roger D. Peng is an author and professor of Statistics and Data Science at the University of Texas at Austin. Peng originally received a Bachelor of Science in Applied Mathematics from Yale University in 1999, before going on to study at the University of California, Los Angeles, where he completed a Master of Science in Statistics in 2001 and a PhD in Statistics in 2003. The focus of his research has been on environmental health, specifically focusing on air pollution and climate change in his research. Peng is also a software engineer who has authored numerous R packages focused on applying statistical methods necessary for a variety of topics. He has also created numerous resources including books, online courses, podcasts, blogs, and other articles to aid those learning data analysis.
Jasjeet "Jas" Singh Sekhon is a data scientist, political scientist, and statistician at Yale University. Sekhon is the Eugene Meyer Professor at Yale University, a fellow of the American Statistical Association, and a fellow of the Society for Political Methodology. Sekhon's primary research interests lie in causal inference, machine learning, and their intersection. He has also published research on their application in various fields including voting behavior, online experimentation, epidemiology, and medicine.
The science-wide author databases of standardized citation indicators is a multidimensional ranking of the world’s scientists produced since 2015 by a team of researchers led by John P. A. Ioannidis at Stanford.