Datasaurus dozen

Last updated

The Datasaurus dozen comprises thirteen data sets that have nearly identical simple descriptive statistics to two decimal places, yet have very different distributions and appear very different when graphed. [1] It was inspired by the smaller Anscombe's quartet that was created in 1973.

Contents

Data

The following table contains summary statistics for all thirteen data sets.

PropertyValueAccuracy
Number of elements142exact
Mean of x54.26to 2 decimal places
Sample variance of x: s2
x
16.76to 2 decimal places
Mean of y47.83to 2 decimal places
Sample variance of y: s2
y
26.93to 2 decimal places
Correlation between x and y−0.06to 3 decimal places
Linear regression liney = 53  0.1xto 0 and 1 decimal places, respectively
Coefficient of determination of the linear regression: 0.004to 3 decimal places
The thirteen data sets in the Datasaurus Dozen, visualized and summarized EDA example - Always plot your data.jpg
The thirteen data sets in the Datasaurus Dozen, visualized and summarized

The thirteen data sets were labeled as the following:

Similar to the Anscombe's quartet, the Datasaurus dozen was designed to further illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic data sets. [2] [3] [4] [5] [1] [6]

Creation

The dinosaur data set created by Alberto Cairo that inspired the creation of the Datasaurus Dozen Datasaurus.png
The dinosaur data set created by Alberto Cairo that inspired the creation of the Datasaurus Dozen

The first data set, in the shape of a Tyrannosaurus, that inspired the rest of the "datasaurus" data set was constructed in 2016 by Alberto Cairo. [7] [8] It was proposed by Maarten Lambrechts that this data set also be called "Anscombosaurus". [7]

This data set was then accompanied by twelve other data sets that were created by Justin Matejka and George Fitzmaurice at Autodesk. Unlike the Anscombe's quartet, where it is not known how the data set was generated, [9] the authors used simulated annealing to make these data sets. They made small, random, and biased changes to each point towards the desired shape. Each shape took 200,000 iterations of perturbations to complete. [1]

The pseudocode for this algorithm is as follows:

current_ds ← initial_ds for x iterations, do:     test_ds ← perturb(current_ds, temp)     if similar_enough(test_ds, initial_ds):         current_ds ← test_ds  function perturb(ds, temp):     loop:         test ← move_random_points(ds)         if fit(test) > fit(ds) or temp > random():             return test 

where

See also

Related Research Articles

<span class="mw-page-title-main">Simulated annealing</span> Probabilistic optimization technique and metaheuristic

Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. For large numbers of local optima, SA can find the global optima. It is often used when the search space is discrete. For problems where finding an approximate global optimum is more important than finding a precise local optimum in a fixed amount of time, simulated annealing may be preferable to exact algorithms such as gradient descent or branch and bound.

<span class="mw-page-title-main">Overfitting</span> Flaw in mathematical modelling

In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitted model is a mathematical model that contains more parameters than can be justified by the data. In a mathematical sense, these parameters represent the degree of a polynomial. The essence of overfitting is to have unknowingly extracted some of the residual variation as if that variation represented underlying model structure.

<span class="mw-page-title-main">Time series</span> Sequence of data points over time

In mathematics, a time series is a series of data points indexed in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.

<span class="mw-page-title-main">Cross-validation (statistics)</span> Statistical model validation technique

Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Cross-validation includes resampling and sample splitting methods that use different portions of the data to test and train a model on different iterations. It is often used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. It can also be used to assess the quality of a fitted model and the stability of its parameters.

Linear trend estimation is a statistical technique used to analyze data patterns. When a series of measurements of a process are treated as a sequence or time series, trend estimation can be used to make and justify statements about tendencies in the data by relating the measurements to the times at which they occurred. This model can then be used to describe the behavior of the observed data.

<span class="mw-page-title-main">Regression analysis</span> Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

Random sample consensus (RANSAC) is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates. Therefore, it also can be interpreted as an outlier detection method. It is a non-deterministic algorithm in the sense that it produces a reasonable result only with a certain probability, with this probability increasing as more iterations are allowed. The algorithm was first published by Fischler and Bolles at SRI International in 1981. They used RANSAC to solve the Location Determination Problem (LDP), where the goal is to determine the points in the space that project onto an image into a set of landmarks with known locations.

Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

In statistics, resampling is the creation of new samples based on one observed sample. Resampling methods are:

  1. Permutation tests
  2. Bootstrapping
  3. Cross validation
<span class="mw-page-title-main">Stepwise regression</span> Method of statistical factor analysis

In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion. Usually, this takes the form of a forward, backward, or combined sequence of F-tests or t-tests.

<span class="mw-page-title-main">Anscombe's quartet</span> Four data sets with the same descriptive statistics, yet very different distributions

Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (xy) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data when analyzing it, and the effect of outliers and other influential observations on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough".

<span class="mw-page-title-main">Data transformation (statistics)</span>

In statistics, data transformation is the application of a deterministic mathematical function to each point in a data set—that is, each data point zi is replaced with the transformed value yi = f(zi), where f is a function. Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs.

<span class="mw-page-title-main">Frank Anscombe</span> English statistician

Francis John Anscombe was an English statistician.

In statistics, regression validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from regression analysis, are acceptable as descriptions of the data. The validation process can involve analyzing the goodness of fit of the regression, analyzing whether the regression residuals are random, and checking whether the model's predictive performance deteriorates substantially when applied to data that were not used in model estimation.

<span class="mw-page-title-main">Plot (graphics)</span> Graphical technique for data sets

A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a computer. In the past, sometimes mechanical or electronic plotters were used. Graphs are a visual representation of the relationship between variables, which are very useful for humans who can then quickly derive an understanding which may not have come from lists of values. Given a scale or ruler, graphs can also be used to read off the value of an unknown variable plotted as a function of a known one, but this can also be done with data presented in tabular form. Graphs of functions are used in mathematics, sciences, engineering, technology, finance, and other areas.

<span class="mw-page-title-main">Influential observation</span> Observation that would cause a large change if deleted

In statistics, an influential observation is an observation for a statistical calculation whose deletion from the dataset would noticeably change the result of the calculation. In particular, in regression analysis an influential observation is one whose deletion has a large effect on the parameter estimates.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Symbolic regression</span> Type of regression analysis

Symbolic regression (SR) is a type of regression analysis that searches the space of mathematical expressions to find the model that best fits a given dataset, both in terms of accuracy and simplicity.

<span class="mw-page-title-main">Alberto Cairo</span> Information designer and professor

Alberto Cairo is a Spanish information designer and professor. Cairo is the Knight Chair in Visual Journalism at the School of Communication of the University of Miami.

References

  1. 1 2 3 Matejka, Justin; Fitzmaurice, George (2017-05-02). "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing" (PDF). Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. CHI '17. New York, NY, USA: Association for Computing Machinery: 1290–1294. doi:10.1145/3025453.3025912. ISBN   978-1-4503-4655-9. Archived from the original on 2017-05-02.
  2. Elert, Glenn (2021). "Linear Regression". The Physics Hypertextbook.
  3. Janert, Philipp K. (2010). Data Analysis with Open Source Tools. O'Reilly Media. pp.  65–66. ISBN   978-0-596-80235-6.
  4. Chatterjee, Samprit; Hadi, Ali S. (2006). Regression Analysis by Example. John Wiley and Sons. p. 91. ISBN   0-471-74696-7.
  5. Saville, David J.; Wood, Graham R. (1991). Statistical Methods: The geometric approach. Springer. p. 418. ISBN   0-387-97517-9.
  6. Tufte, Edward R. (2001). The Visual Display of Quantitative Information (2nd ed.). Cheshire, CT: Graphics Press. ISBN   0-9613921-4-2.
  7. 1 2 Cairo, Alberto. "Download the Datasaurus: Never trust summary statistics alone; always visualize your data" . Retrieved 2024-02-01.
  8. Murtagh, Jack (2024-02-01). "What This Graph of a Dinosaur Can Teach Us about Doing Better Science". Scientific American. Retrieved 2024-03-08.
  9. Chatterjee, Sangit; Firat, Aykut (2007). "Generating Data with Identical Statistics but Dissimilar Graphics: A follow up to the Anscombe dataset". The American Statistician . 61 (3): 248–254. doi:10.1198/000313007X220057. JSTOR   27643902. S2CID   121163371.