Datasaurus dozen

Last updated

The Datasaurus dozen comprises thirteen data sets that have nearly identical simple descriptive statistics to two decimal places, yet have very different distributions and appear very different when graphed. [1] It was inspired by the smaller Anscombe's quartet that was created in 1973.

Contents

Data

The following table contains summary statistics for all thirteen data sets.

PropertyValueAccuracy
Number of elements142exact
Mean of x54.26to 2 decimal places
Sample variance of x: s2
x
16.76to 2 decimal places
Mean of y47.83to 2 decimal places
Sample variance of y: s2
y
26.93to 2 decimal places
Correlation between x and y−0.06to 3 decimal places
Linear regression liney = 53  0.1xto 0 and 1 decimal places, respectively
Coefficient of determination of the linear regression: 0.004to 3 decimal places
The thirteen data sets in the Datasaurus Dozen, visualized and summarized EDA example - Always plot your data.jpg
The thirteen data sets in the Datasaurus Dozen, visualized and summarized

The thirteen data sets were labeled as the following:

Similar to the Anscombe's quartet, the Datasaurus dozen was designed to further illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic data sets. [2] [3] [4] [5] [1] [6]

Creation

The dinosaur data set created by Alberto Cairo that inspired the creation of the Datasaurus Dozen Datasaurus.png
The dinosaur data set created by Alberto Cairo that inspired the creation of the Datasaurus Dozen

The first data set, in the shape of a Tyrannosaurus, that inspired the rest of the "datasaurus" data set was constructed in 2016 by Alberto Cairo. [7] [8] It was proposed by Maarten Lambrechts that this data set also be called "Anscombosaurus". [7]

This data set was then accompanied by twelve other data sets that were created by Justin Matejka and George Fitzmaurice at Autodesk. Unlike the Anscombe's quartet, where it is not known how the data set was generated, [9] the authors used simulated annealing to make these data sets. They made small, random, and biased changes to each point towards the desired shape. Each shape took 200,000 iterations of perturbations to complete. [1]

The pseudocode for this algorithm is as follows:

current_ds ← initial_ds for x iterations, do:     test_ds ← perturb(current_ds, temp)     if similar_enough(test_ds, initial_ds):         current_ds ← test_ds  function perturb(ds, temp):     loop:         test ← move_random_points(ds)         if fit(test) > fit(ds) or temp > random():             return test 

where

See also

References

  1. 1 2 3 Matejka, Justin; Fitzmaurice, George (2017-05-02). "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing" (PDF). Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. CHI '17. New York, NY, USA: Association for Computing Machinery. pp. 1290–1294. doi:10.1145/3025453.3025912. ISBN   978-1-4503-4655-9. Archived from the original on 2017-05-02.
  2. Elert, Glenn (2021). "Linear Regression - Practice". The Physics Hypertextbook.
  3. Janert, Philipp K. (2010). Data Analysis with Open Source Tools. O'Reilly Media. pp.  65–66. ISBN   978-0-596-80235-6.
  4. Chatterjee, Samprit; Hadi, Ali S. (2006). Regression Analysis by Example. John Wiley and Sons. p. 91. ISBN   0-471-74696-7.
  5. Saville, David J.; Wood, Graham R. (1991). Statistical Methods: The geometric approach. Springer. p. 418. ISBN   0-387-97517-9.
  6. Tufte, Edward R. (2001). The Visual Display of Quantitative Information (2nd ed.). Cheshire, CT: Graphics Press. ISBN   0-9613921-4-2.
  7. 1 2 Cairo, Alberto. "Download the Datasaurus: Never trust summary statistics alone; always visualize your data". Archived from the original on 2024-06-20. Retrieved 2024-02-01.
  8. Murtagh, Jack (2024-02-01). "What This Graph of a Dinosaur Can Teach Us about Doing Better Science". Scientific American. Retrieved 2024-03-08.
  9. Chatterjee, Sangit; Firat, Aykut (2007). "Generating Data with Identical Statistics but Dissimilar Graphics: A follow up to the Anscombe dataset". The American Statistician . 61 (3): 248–254. doi:10.1198/000313007X220057. JSTOR   27643902. S2CID   121163371.