**Grouped data** are data formed by aggregating individual observations of a variable into groups, so that a frequency distribution of these groups serves as a convenient means of summarizing or analyzing the data. There are two major types of grouping: data binning of a single-dimensional variable, replacing individual numbers by counts in bins; and grouping multi-dimensional variables by some of the dimensions (especially by independent variables), obtaining the distribution of ungrouped dimensions (especially the dependent variables).

The idea of grouped data can be illustrated by considering the following raw dataset:

20 | 25 | 24 | 33 | 13 | 26 | 8 | 19 | 31 | 11 | 16 | 21 | 17 | 11 | 34 | 14 | 15 | 21 | 18 | 17 |

The above data can be grouped in order to construct a frequency distribution in any of several ways. One method is to use intervals as a basis.

The smallest value in the above data is 8 and the largest is 34. The interval from 8 to 34 is broken up into smaller subintervals (called *class intervals*). For each class interval, the number of data items falling in this interval is counted. This number is called the *frequency* of that class interval. The results are tabulated as a frequency table as follows:

Time taken (in seconds) | Frequency |
---|---|

5 ≤ t < 10 | 1 |

10 ≤ t < 15 | 4 |

15 ≤ t < 20 | 6 |

20 ≤ t < 25 | 4 |

25 ≤ t < 30 | 2 |

30 ≤ t < 35 | 3 |

Another method of grouping the data is to use some qualitative characteristics instead of numerical intervals. For example, suppose in the above example, there are three types of students: 1) Below normal, if the response time is 5 to 14 seconds, 2) normal if it is between 15 and 24 seconds, and 3) above normal if it is 25 seconds or more, then the grouped data looks like:

Frequency | |
---|---|

Below normal | 5 |

Normal | 10 |

Above normal | 5 |

Yet another example of grouping the data is the use of some commonly used numerical values, which are in fact "names" we assign to the categories. For example, let us look at the age distribution of the students in a class. The students may be 10 years old, 11 years old or 12 years old. These are the age groups, 10, 11, and 12. Note that the students in age group 10 are from 10 years and 0 days, to 10 years and 364 days old, and their average age is 10.5 years old if we look at age in a continuous scale. The grouped data looks like:

Age | Frequency |
---|---|

10 | 10 |

11 | 20 |

12 | 10 |

An estimate, , of the mean of the population from which the data are drawn can be calculated from the grouped data as:

In this formula, *x* refers to the midpoint of the class intervals, and *f* is the class frequency. Note that the result of this will be different from the sample mean of the ungrouped data. The mean for the grouped data in the above example, can be calculated as follows:

Class Intervals | Frequency ( f ) | Midpoint ( x ) | f x |
---|---|---|---|

5 and above, below 10 | 1 | 7.5 | 7.5 |

10 ≤ t < 15 | 4 | 12.5 | 50 |

15 ≤ t < 20 | 6 | 17.5 | 105 |

20 ≤ t < 25 | 4 | 22.5 | 90 |

25 ≤ t < 30 | 2 | 27.5 | 55 |

30 ≤ t < 35 | 3 | 32.5 | 97.5 |

TOTAL | 20 | 405 |

Thus, the mean of the grouped data is

The mean for the grouped data in example 4 above can be calculated as follows:

Age Group | Frequency ( f ) | Midpoint ( x ) | f x |
---|---|---|---|

10 | 10 | 10.5 | 105 |

11 | 20 | 11.5 | 230 |

12 | 10 | 12.5 | 125 |

TOTAL | 40 | 460 |

Thus, the mean of the grouped data is

- Aggregate data
- Data binning
- Partition of a set
- Level of measurement
- Frequency distribution
- Discretization of continuous features
- Logistic regression § Minimum chi-squared estimator for grouped data

This article includes a list of general references, but it lacks sufficient corresponding inline citations .(June 2010) |

A **histogram** is an approximate representation of the distribution of numerical data. It was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often of equal size.

In statistics and probability theory, the **median** is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of a "typical" value. Median income, for example, may be a better way to suggest what a "typical" income is, because income distribution can be very skewed. The median is of central importance in robust statistics, as it is the most resistant statistic, having a breakdown point of 50%: so long as no more than half the data are contaminated, the median is not an arbitrarily large or small result.

There are several kinds of **mean** in mathematics, especially in statistics.

A **normal distribution** is a probability distribution used to model phenomena that have a default behaviour and cumulative possible deviations from that behaviour. For instance, a proficient archer's arrows are expected to land around the bull's eye of the target; however, due to aggregating imperfections in the archer's technique, most arrows will miss the bull's eye by some distance. The average of this distance is known in archery as *accuracy*, while the amount of variation in the distances as *precision*. In the context of a normal distribution, accuracy and precision are referred to as the *mean* and the *standard deviation*, respectively. Thus, a narrow measure of an archer's proficiency can be expressed with two values: a mean and a standard deviation. In a normal distribution, these two values mean: there is a ~68% probability that an arrow will land within one standard deviation of the archer's average accuracy; a ~95% probability that an arrow will land within two standard deviations of the archer's average accuracy; ~99.7% within three; and so on, slowly increasing towards 100%.

In statistics, the **standard deviation** is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

The **weighted arithmetic mean** is similar to an ordinary arithmetic mean, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The notion of weighted mean plays a role in descriptive statistics and also occurs in a more general form in several other areas of mathematics.

In probability and statistics, **Student's t-distribution** is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in situations where the sample size is small and the population's standard deviation is unknown. It was developed by English statistician William Sealy Gosset under the pseudonym "Student".

In statistics, the **Pearson correlation coefficient** ― also known as **Pearson's r**, the

In statistics, the **standard score** is the number of standard deviations by which the value of a raw score is above or below the mean value of what is being observed or measured. Raw scores above the mean have positive standard scores, while those below the mean have negative standard scores.

In statistics, a **frequency distribution** is a list, table or graph that displays the frequency of various outcomes in a sample. Each entry in the table contains the frequency or count of the occurrences of values within a particular group or interval.

In frequentist statistics, a **confidence interval** (**CI**) is a range of estimates for an unknown parameter, defined as an interval with a lower bound and an upper bound. The interval is computed at a designated **confidence level**. The 95% confidence level is most common, but other levels are sometimes used. The confidence level represents the long-run frequency of confidence intervals that contain the true value of the parameter. In other words, 95% of confidence intervals computed at the 95% confidence level contain the parameter, and likewise for other confidence levels.

In statistics and optimization, **errors** and **residuals** are two closely related and easily confused measures of the deviation of an observed value of an element of a statistical sample from its "true value". The **error** of an observation is the deviation of the observed value from the true value of a quantity of interest. The **residual** is the difference between the observed value and the *estimated* value of the quantity of interest. The distinction is most important in regression analysis, where the concepts are sometimes called the **regression errors** and **regression residuals** and where they lead to the concept of studentized residuals.

The ** t-test** is any statistical hypothesis test in which the test statistic follows a Student's

The **standard error** (**SE**) of a statistic is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the **standard error of the mean** (**SEM**).

In probability theory and statistics, the **continuous uniform distribution** or **rectangular distribution** is a family of symmetric probability distributions. The distribution describes an experiment where there is an arbitrary outcome that lies between certain bounds. The bounds are defined by the parameters, *a* and *b*, which are the minimum and maximum values. The interval can either be closed or open. Therefore, the distribution is often abbreviated *U*, where U stands for uniform distribution. The difference between the bounds defines the interval length; all intervals of the same length on the distribution's support are equally probable. It is the maximum entropy probability distribution for a random variable *X* under no constraint other than that it is contained in the distribution's support.

In probability theory and directional statistics, the **von Mises distribution** is a continuous probability distribution on the circle. It is a close approximation to the wrapped normal distribution, which is the circular analogue of the normal distribution. A freely diffusing angle on a circle is a wrapped normally distributed random variable with an unwrapped variance that grows linearly in time. On the other hand, the von Mises distribution is the stationary distribution of a drift and diffusion process on the circle in a harmonic potential, i.e. with a preferred orientation. The von Mises distribution is the maximum entropy distribution for circular data when the real and imaginary parts of the first circular moment are specified. The von Mises distribution is a special case of the von Mises–Fisher distribution on the *N*-dimensional sphere.

This **glossary of statistics and probability** is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics.

In statistics, **simple linear regression** is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective *simple* refers to the fact that the outcome variable is related to a single predictor.

The **sample mean** and the **sample covariance** are statistics computed from a sample of data on one or more random variables.

In probability and statistics, the **Bates distribution**, named after Grace Bates, is a probability distribution of the mean of a number of statistically independent uniformly distributed random variables on the unit interval. This distribution is related to the uniform, the triangular and, the normal Gaussian distribution, and has applications in broadcast engineering for signal enhancement. The Bates distribution is sometimes confused with the Irwin–Hall distribution, which is the distribution of the **sum** of *n* independent random variables uniformly distributed from 0 to 1. Thus, the two distributions are simply *versions* of each other as they only differ in scale.

- Newbold, P.; Carlson, W.; Thorne, B. (2009).
*Statistics for Business and Economics*(Seventh ed.). Pearson Education. ISBN 978-0-13-507248-6.

This page is based on this Wikipedia article

Text is available under the CC BY-SA 4.0 license; additional terms may apply.

Images, videos and audio are available under their respective licenses.

Text is available under the CC BY-SA 4.0 license; additional terms may apply.

Images, videos and audio are available under their respective licenses.