Statistical data type

Last updated

In statistics, data can have any of various types. Statistical data types include categorical (e.g. country), directional (angles or directions, e.g. wind measurements), count (a whole number of events), or real intervals (e.g. measures of temperature).

Contents

The data type is a fundamental concept in statistics and controls what sorts of probability distributions can logically be used to describe the variable, the permissible operations on the variable, the type of regression analysis used to predict the variable, etc. The concept of data type is similar to the concept of level of measurement, but more specific. For example, count data requires a different distribution (e.g. a Poisson distribution or binomial distribution) than non-negative real-valued data require, but both fall under the same level of measurement (a ratio scale).

Various attempts have been made to produce a taxonomy of levels of measurement. The psychophysicist Stanley Smith Stevens defined nominal, ordinal, interval, and ratio scales. Nominal measurements do not have meaningful rank order among values, and permit any one-to-one transformation. Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation. Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary (as in the case with longitude and temperature measurements in degree Celsius or degree Fahrenheit), and permit any linear transformation. Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any rescaling transformation.

Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together as categorical variables, whereas ratio and interval measurements are grouped together as quantitative variables, which can be either discrete or continuous, due to their numerical nature. Such distinctions can often be loosely correlated with data type in computer science, in that dichotomous categorical variables may be represented with the Boolean data type, polytomous categorical variables with arbitrarily assigned integers in the integral data type, and continuous variables with the real data type involving floating point computation. But the mapping of computer science data types to statistical data types depends on which categorization of the latter is being implemented.

Other categorizations have been proposed. For example, Mosteller and Tukey (1977) [1] distinguished grades, ranks, counted fractions, counts, amounts, and balances. Nelder (1990) [2] described continuous counts, continuous ratios, count ratios, and categorical modes of data. See also Chrisman (1998), [3] van den Berg (1991). [4]

The issue of whether or not it is appropriate to apply different kinds of statistical methods to data obtained from different kinds of measurement procedures is complicated by issues concerning the transformation of variables and the precise interpretation of research questions. "The relationship between the data and what they describe merely reflects the fact that certain kinds of statistical statements may have truth values which are not invariant under some transformations. Whether or not a transformation is sensible to contemplate depends on the question one is trying to answer" (Hand, 2004, p. 82). [5]

Simple data types

The following table classifies the various simple data types, associated distributions, permissible operations, etc. Regardless of the logical possible values, all of these data types are generally coded using real numbers, because the theory of random variables often explicitly assumes that they hold real numbers.

Data Type
Possible valuesExample usage
Level of
measurement
Common

Distributions

Scale of
relative
differences
Permissible statisticsCommon model
0, 1 (arbitrary labels)binary outcome ("yes/no", "true/false", "success/failure", etc.) Bernoulli mode, chi-squared logistic, probit
"name1", "name2", "name3", ... "nameK" (arbitrary labels)categorical outcome with names or places like "Rome", "Amsterdam", "Madrid", "London", "Washington" (specific blood type, political party, word, etc.) categorical multinomial logit, multinomial probit
ordering categories or integer or real number (arbitrary scale)Ordering adverbs like "Small", "Medium", "Large", relative score, significant only for creating a ranking categorical
relative
comparison
ordinal regression (ordered logit, ordered probit)
0, 1, ..., Nnumber of successes (e.g. yes votes) out of N possible binomial, beta-binomial
additive
mean, median, mode, standard deviation, correlation binomial regression (logistic, probit)
nonnegative integers (0, 1, ...)number of items (telephone calls, people, molecules, births, deaths, etc.) in given interval/area/volume Poisson, negative binomial
multiplicative
All statistics permitted for interval scales plus the following: geometric mean, harmonic mean, coefficient of variation Poisson, negative binomial regression
real-valued
additive
real number temperature in degree Celsius or degree Fahrenheit, relative distance, location parameter, etc. (or approximately, anything not varying over a large scale) normal, etc. (usually symmetric about the mean)
additive
mean, median, mode, standard deviation, correlation standard linear regression
real-valued
multiplicative
positive real number temperature in kelvin, price, income, size, scale parameter, etc. (especially when varying over a large scale) log-normal, gamma, exponential, etc. (usually a skewed distribution)
multiplicative
All statistics permitted for interval scales plus the following: geometric mean, harmonic mean, coefficient of variation generalized linear model with logarithmic link

Multivariate data types

Data that cannot be described using a single number are often shoehorned into random vectors of real-valued random variables, although there is an increasing tendency to treat them on their own. Some examples:

These concepts originate in various scientific fields and frequently overlap in usage. As a result, it is very often the case that multiple concepts could potentially be applied to the same problem.

Related Research Articles

<span class="mw-page-title-main">Probability distribution</span> Mathematical function for the probability a given outcome occurs in an experiment

In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of possible outcomes for an experiment. It is a mathematical description of a random phenomenon in terms of its sample space and the probabilities of events.

<span class="mw-page-title-main">Statistics</span> Study of the collection and analysis of data

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

<span class="mw-page-title-main">White noise</span> Type of signal in signal processing

In signal processing, white noise is a random signal having equal intensity at different frequencies, giving it a constant power spectral density. The term is used with this or similar meanings in many scientific and technical disciplines, including physics, acoustical engineering, telecommunications, and statistical forecasting. White noise refers to a statistical model for signals and signal sources, not to any specific signal. White noise draws its name from white light, although light that appears white generally does not have a flat power spectral density over the visible band.

Nonparametric statistics is a type of statistical analysis that makes minimal assumptions about the underlying distribution of the data being studied. Often these models are infinite-dimensional, rather than finite dimensional, as is parametric statistics. Nonparametric statistics can be used for descriptive statistics or statistical inference. Nonparametric tests are often used when the assumptions of parametric tests are evidently violated.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scales, of measurement: nominal, ordinal, interval, and ratio. This framework of distinguishing levels of measurement originated in psychology and has since had a complex history, being adopted and extended in some disciplines and by some scholars, and criticized or rejected by others. Other classifications include those by Mosteller and Tukey, and by Chrisman.

In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property. In computer science and some branches of mathematics, categorical variables are referred to as enumerations or enumerated types. Commonly, each of the possible values of a categorical variable is referred to as a level. The probability distribution associated with a random categorical variable is called a categorical distribution.

<span class="mw-page-title-main">Mathematical statistics</span> Branch of statistics

Mathematical statistics is the application of probability theory, a branch of mathematics, to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques which are used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure theory.

Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra.

When classification is performed by a computer, statistical methods are normally used to develop the algorithm.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

<span class="mw-page-title-main">Data transformation (statistics)</span> Application of a function to each point in a data set

In statistics, data transformation is the application of a deterministic mathematical function to each point in a data set—that is, each data point zi is replaced with the transformed value yi = f(zi), where f is a function. Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs.

In statistics, count data is a statistical data type describing countable quantities, data which can take only the counting numbers, non-negative integer values {0, 1, 2, 3, ...}, and where these integers arise from counting rather than ranking. The statistical treatment of count data is distinct from that of binary data, in which the observations can take only two values, usually represented by 0 and 1, and from ordinal data, which may also consist of integers but where the individual values fall on an arbitrary scale and only the relative ranking is important.

A correlation coefficient is a numerical measure of some type of linear correlation, meaning a statistical relationship between two variables. The variables may be two columns of a given data set of observations, often called a sample, or two components of a multivariate random variable with a known distribution.

<span class="mw-page-title-main">Bivariate analysis</span> Concept in statistical analysis

Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis. It involves the analysis of two variables, for the purpose of determining the empirical relationship between them.

In statistics, linear regression is a model that estimates the linear relationship between a scalar response and one or more explanatory variables. A model with exactly one explanatory variable is a simple linear regression; a model with two or more explanatory variables is a multiple linear regression. This term is distinct from multivariate linear regression, which predicts multiple correlated dependent variables rather than a single dependent variable.

Univariate is a term commonly used in statistics to describe a type of data which consists of observations on only a single characteristic or attribute. A simple example of univariate data would be the salaries of workers in industry. Like all the other data, univariate data can be visualized using graphs, images or other analysis tools after the data is measured, collected, reported, and analyzed.

Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories are not known. These data exist on an ordinal scale, one of four levels of measurement described by S. S. Stevens in 1946. The ordinal scale is distinguished from the nominal scale by having a ranking. It also differs from the interval scale and ratio scale by not having category widths that represent equal increments of the underlying attribute.

References

  1. Mosteller, F.; Tukey, J.W. (1977). Data analysis and regression. Addison-Wesley. ISBN   978-0-201-04854-4.
  2. Nelder, J.A. (1990). "The knowledge needed to computerise the analysis and interpretation of statistical information". Expert systems and artificial intelligence: the need for information about data. London: Library Association. OCLC   27042489.
  3. Chrisman, Nicholas R. (1998). "Rethinking Levels of Measurement for Cartography". Cartography and Geographic Information Science. 25 (4): 231–242. Bibcode:1998CGISy..25..231C. doi:10.1559/152304098782383043.
  4. van den Berg, G. (1991). Choosing an analysis method. Leiden: DSWO Press. ISBN   978-90-6695-062-7.
  5. Hand, D.J. (2004). Measurement theory and practice: The world through quantification. Wiley. p. 82. ISBN   978-0-470-68567-9.