Uncomfortable science

Last updated

Uncomfortable science, as identified by statistician John Tukey, [1] [2] comprises situations in which there is a need to draw an inference from a limited sample of data, where further samples influenced by the same cause system will not be available. More specifically, it involves the analysis of a finite natural phenomenon for which it is difficult to overcome the problem of using a common sample of data for both exploratory data analysis and confirmatory data analysis. This leads to the danger of systematic bias through testing hypotheses suggested by the data.

Contents

A typical example is Bode's law, which provides a simple numerical rule for the distances of the planets in the solar system from the Sun. Once the rule has been derived, through the trial and error matching of various rules with the observed data (exploratory data analysis), there are not enough planets remaining for a rigorous and independent test of the hypothesis (confirmatory data analysis). We have exhausted the natural phenomena. The agreement between data and the numerical rule should be no surprise, as we have deliberately chosen the rule to match the data. If we are concerned about what Bode's law tells us about the cause system of planetary distribution then we demand confirmation that will not be available until better information about other planetary systems becomes available.

See also

Bibliography

Related Research Articles

Statistics Study of the collection, analysis, interpretation, and presentation of data

Statistics is the discipline that concerns the collection, organization, analysis, interpretation and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments. See glossary of probability and statistics.

The Titius–Bode law is a hypothesis that the bodies in some orbital systems, including the Sun's, orbit at semi-major axes in a function of planetary sequence. The formula suggests that, extending outward, each planet would be approximately twice as far from the Sun as the one before. The hypothesis correctly anticipated the orbits of Ceres and Uranus, but failed as a predictor of Neptune's orbit and was eventually superseded as a theory of Solar System formation. It is named after Johann Daniel Titius and Johann Elert Bode.

Outlier observation far apart from others in statistics and data science

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

The five-number summary is a set of descriptive statistics that provides information about a dataset. It consists of the five most important sample percentiles:

  1. the sample minimum
  2. the lower quartile or first quartile
  3. the median
  4. the upper quartile or third quartile
  5. the sample maximum

Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions. Nonparametric statistics is based on either being distribution-free or having a specified distribution but with the distribution's parameters unspecified. Nonparametric statistics includes both descriptive statistics and statistical inference.

John Tukey American mathematician

John Wilder Tukey was an American mathematician best known for development of the Fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distribution, the Tukey test of additivity, and the Teichmüller–Tukey lemma all bear his name. He is also credited with coining the term 'bit'.

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

Stem-and-leaf display plot where the ones digit is separated from the other digits, showing the distribution of the ones digit

A stem-and-leaf display or stem-and-leaf plot is a device for presenting quantitative data in a graphical format, similar to a histogram, to assist in visualizing the shape of a distribution. They evolved from Arthur Bowl's work in the early 1900s, and are useful tools in exploratory data analysis. Stemplots became more commonly used in the 1980s after the publication of John Tukey's book on exploratory data analysis in 1977. The popularity during those years is attributable to their use of monospaced (typewriter) typestyles that allowed computer technology of the time to easily produce the graphics. Modern computers' superior graphic capabilities have meant these techniques are less often used.

The Lunar and Planetary Institute (LPI) is a scientific research institute dedicated to study of the solar system, its formation, evolution, and current state. The Institute is part of the Universities Space Research Association (USRA) and is supported by the Science Mission Directorate of the National Aeronautics and Space Administration (NASA). Located at 3600 Bay Area Boulevard in Houston, Texas, the LPI maintains an extensive collection of lunar and planetary data, carries out education and public outreach programs, and offers meeting coordination and publishing services. The LPI sponsors and organizes several workshops and conferences throughout the year, including the Lunar and Planetary Science Conference (LPSC) held in March in the Houston area.

Data analysis Activity for gaining insight from data

Data analysis is a process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusion and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.

A research design is the set of methods and procedures used in collecting and analyzing measures of the variables specified in the problem research. The design of a study defines the study type and sub-type, research problem, hypotheses, independent and dependent variables, experimental design, and, if applicable, data collection methods and a statistical analysis plan. A research design is a framework that has been created to find answers to research questions.

Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers of the images. This communication is achieved through the use of a systematic mapping between graphic marks and data values in the creation of the visualization. This mapping establishes how data values will be represented visually, determining how and to what extent a property of a graphic mark, such as size or color, will change to reflect changes in the value of a datum.

Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modelling, and machine learning, that analyze current and historical facts to make predictions about future or otherwise unknown events.

In statistics, the median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data. It can also refer to the population parameter that is estimated by the MAD calculated from a sample.

In statistics, confirmatory factor analysis (CFA) is a special form of factor analysis, most commonly used in social research. It is used to test whether measures of a construct are consistent with a researcher's understanding of the nature of that construct. As such, the objective of confirmatory factor analysis is to test whether the data fit a hypothesized measurement model. This hypothesized model is based on theory and/or previous analytic research. CFA was first developed by Jöreskog and has built upon and replaced older methods of analyzing construct validity such as the MTMM Matrix as described in Campbell & Fiske (1959).

The median polish is a simple and robust exploratory data analysis procedure proposed by the statistician John Tukey. The purpose of median polish is to find an additively-fit model for data in a two-way layout table of the form row effect + column effect + overall median.

Theoretical planetology

Theoretical planetology, also known as theoretical planetary science is a branch of planetary sciences that developed in the 20th century.

Psychometric software is software that is used for psychometric analysis of data from tests, questionnaires, or inventories reflecting latent psychoeducational variables. While some psychometric analyses can be performed with standard statistical software like SPSS, most analyses require specialized tools.

Optimal Discriminant Analysis (ODA) and the related classification tree analysis (CTA) are exact statistical methods that maximize predictive accuracy. For any specific sample and exploratory or confirmatory hypothesis, optimal discriminant analysis (ODA) identifies the statistical model that yields maximum predictive accuracy, assesses the exact Type I error rate, and evaluates potential cross-generalizability. Optimal discriminant analysis may be applied to > 0 dimensions, with the one-dimensional case being referred to as UniODA and the multidimensional case being referred to as MultiODA. Classification tree analysis is a generalization of optimal discriminant analysis to non-orthogonal trees. Classification tree analysis has more recently been called "hierarchical optimal discriminant analysis". Optimal discriminant analysis and classification tree analysis may be used to find the combination of variables and cut points that best separate classes of objects or events. These variables and cut points may then be used to reduce dimensions and to then build a statistical model that optimally describes the data.

Exploratory thought is an academic term used in the field of psychology to describe reasoning that neutrally considers multiple points of view and tries to anticipate all possible objections to, or flaws in, a particular position, with the goal of seeking truth. The opposite of exploratory thought is confirmatory thought, which is reasoning designed to construct justification supporting a specific point of view.

References

  1. Norel, R.; Rice, J. J.; Stolovitzky, G. (2011). "The self-assessment trap: Can we all be better than average?". Molecular Systems Biology. 7: 537. doi:10.1038/msb.2011.70. PMC   3261704 . PMID   21988833.
  2. Hoaglin, D.C; et al. (eds.). Exploring Data Tables Trends and Shapes . Wiley. ISBN   0-471-09776-4. Much of science also falls under John Tukey's label "uncomfortable science," because real repetition is not feasible or practical.