In statistics, a **frequency distribution** is a list, table or graph that displays the frequency of various outcomes in a sample.^{ [1] } Each entry in the table contains the frequency or count of the occurrences of values within a particular group or interval.

Here is an example of a univariate (=single variable) frequency table. The frequency of each response to a survey question is depicted.

Rank | Degree of agreement | Number |
---|---|---|

1 | Strongly agree | 22 |

2 | Agree somewhat | 30 |

3 | Not sure | 20 |

4 | Disagree somewhat | 15 |

5 | Strongly disagree | 15 |

A different tabulation scheme aggregates values into bins such that each bin encompasses a range of values. For example, the heights of the students in a class could be organized into the following frequency table.

Height range | Number of students | Cumulative number |
---|---|---|

less than 5.0 feet | 25 | 25 |

5.0–5.5 feet | 35 | 60 |

5.5–6.0 feet | 20 | 80 |

6.0–6.5 feet | 20 | 100 |

A frequency distribution shows us a summarized grouping of data divided into mutually exclusive classes and the number of occurrences in a class. It is a way of showing unorganized data notably to show results of an election, income of people for a certain region, sales of a product within a certain period, student loan amounts of graduates, etc. Some of the graphs that can be used with frequency distributions are histograms, line charts, bar charts and pie charts. Frequency distributions are used for both qualitative and quantitative data.

- Decide the number of classes. Too many classes or too few classes might not reveal the basic shape of the data set, also it will be difficult to interpret such frequency distribution. The ideal number of classes may be determined or estimated by formula: (log base 10), or by the square-root choice formula where
*n*is the total number of observations in the data. (The latter will be much too large for large data sets such as population statistics.) However, these formulas are not a hard rule and the resulting number of classes determined by formula may not always be exactly suitable with the data being dealt with. - Calculate the range of the data (Range = Max – Min) by finding the minimum and maximum data values. Range will be used to determine the class interval or class width.
- Decide the width of the classes, denoted by
*h*and obtained by (assuming the class intervals are the same for all classes).

Generally the class interval or class width is the same for all classes. The classes all taken together must cover at least the distance from the lowest value (minimum) in the data to the highest (maximum) value. Equal class intervals are preferred in frequency distribution, while unequal class intervals (for example logarithmic intervals) may be necessary in certain situations to produce a good spread of observations between the classes and avoid a large number of empty, or almost empty classes.^{ [2] }

- Decide the individual class limits and select a suitable starting point of the first class which is arbitrary; it may be less than or equal to the minimum value. Usually it is started before the minimum value in such a way that the midpoint (the average of lower and upper class limits of the first class) is properly
^{[ clarification needed ]}placed. - Take an observation and mark a vertical bar (|) for a class it belongs. A running tally is kept till the last observation.
- Find the frequencies, relative frequency, cumulative frequency etc. as required.

Bivariate joint frequency distributions are often presented as (two-way) contingency tables:

Dance | Sports | TV | Total | |
---|---|---|---|---|

Men | 2 | 10 | 8 | 20 |

Women | 16 | 6 | 8 | 30 |

Total | 18 | 16 | 16 | 50 |

The total row and total column report the marginal frequencies or marginal distribution, while the body of the table reports the joint frequencies.^{ [3] }

Managing and operating on frequency tabulated data is much simpler than operation on raw data. There are simple algorithms to calculate median, mean, standard deviation etc. from these tables.

Statistical hypothesis testing is founded on the assessment of differences and similarities between frequency distributions. This assessment involves measures of central tendency or averages, such as the mean and median, and measures of variability or statistical dispersion, such as the standard deviation or variance.

A frequency distribution is said to be skewed when its mean and median are significantly different, or more generally when it is asymmetric. The kurtosis of a frequency distribution is a measure of the proportion of extreme values (outliers), which appear at either end of the histogram. If the distribution is more outlier-prone than the normal distribution it is said to be leptokurtic; if less outlier-prone it is said to be platykurtic.

Letter frequency distributions are also used in frequency analysis to crack ciphers, and are used to compare the relative frequencies of letters in different languages and other languages are often used like Greek, Latin, etc.

- ↑ Australian Bureau of Statistics, http://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+language+-+frequency+distribution
- ↑ Manikandan, S (1 January 2011). "Frequency distribution".
*Journal of Pharmacology & Pharmacotherapeutics*.**2**(1): 54–55. doi:10.4103/0976-500X.77120. ISSN 0976-500X. PMC 3117575 . PMID 21701652. - ↑ Stat Trek, Statistics and Probability Glossary,
*s.v.*Joint frequency

- Media related to Frequency distribution at Wikimedia Commons
- Learn 7 ways to make frequency distribution table in Excel

A **descriptive statistic** is a summary statistic that quantitatively describes or summarizes features from a collection of information, while **descriptive statistics** is the process of using and analysing those statistics. Descriptive statistics is distinguished from inferential statistics by its aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, is not developed on the basis of probability theory, and are frequently non-parametric statistics. Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example, in papers reporting on human subjects, typically a table is included giving the overall sample size, sample sizes in important subgroups, and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, the proportion of subjects with related co-morbidities, etc.

In statistics, an **estimator** is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished. For example, the sample mean is a commonly used estimator of the population mean.

A **histogram** is an approximate representation of the distribution of numerical data. It was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often of equal size.

In descriptive statistics, the **interquartile range** (**IQR**), also called the **midspread**, **middle 50%**, or **H‑spread**, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = *Q*_{3} − *Q*_{1}. In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data. It is a trimmed estimator, defined as the 25% trimmed range, and is a commonly used robust measure of scale.

In statistics and probability theory, the **median** is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of a "typical" value. Median income, for example, may be a better way to suggest what a "typical" income is, because income distribution can be very skewed. The median is of central importance in robust statistics, as it is the most resistant statistic, having a breakdown point of 50%: so long as no more than half the data are contaminated, the median is not an arbitrarily large or small result.

In statistics, a **quartile** is a type of quantile which divides the number of data points into four parts, or *quarters*, of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are a form of order statistic. The three main quartiles are as follows:

In statistics and probability, **quantiles** are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile than the number of groups created. Common quantiles have special names, such as quartiles, deciles, and percentiles. The groups created are termed halves, thirds, quarters, etc., though sometimes the terms for the quantile are used for the groups created, rather than for the cut points.

In statistics, the **standard deviation** is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

In statistics, an **outlier** is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

In descriptive statistics, a **box plot** or **boxplot** is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending from the boxes (*whiskers*) indicating variability outside the upper and lower quartiles, hence the terms **box-and-whisker plot** and **box-and-whisker diagram**. Outliers may be plotted as individual points. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers. In addition to the points themselves, they allow one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically. Box plots received their name from the box in the middle, and from the plot that they are.

In probability theory and statistics, the **coefficient of variation** (**CV**), also known as **relative standard deviation** (**RSD**), is a standardized measure of dispersion of a probability distribution or frequency distribution. It is often expressed as a percentage, and is defined as the ratio of the standard deviation to the mean . The CV or RSD is widely used in analytical chemistry to express the precision and repeatability of an assay. It is also commonly used in fields such as engineering or physics when doing quality assurance studies and ANOVA gauge R&R. In addition, CV is utilized by economists and investors in economic models.

The **mode** is the value that appears most often in a set of data values. If * X* is a discrete random variable, the mode is the value

The following is a glossary of terms used in the mathematical sciences statistics and probability.

In statistics, the **mid-range** or **mid-extreme** of a set of statistical data values is the arithmetic mean of the maximum and minimum values in a data set, defined as:

**Robust statistics** are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard-deviations; under this model, non-robust methods like a t-test work poorly.

In statistics the **frequency** of an event is the number of times the observation occurred/recorded in an experiment or study. These frequencies are often graphically represented in histograms.

In statistics, the **sample maximum** and **sample minimum,** also called the **largest observation** and **smallest observation,** are the values of the greatest and least elements of a sample. They are basic summary statistics, used in descriptive statistics such as the five-number summary and Bowley's seven-figure summary and the associated box plot.

**Grouped data** are data formed by aggregating individual observations of a variable into groups, so that a frequency distribution of these groups serves as a convenient means of summarizing or analyzing the data. There are two major types of grouping: data binning of a single-dimensional variable, replacing individual numbers by counts in bins; and grouping multi-dimensional variables by some of the dimensions, obtaining the distribution of ungrouped dimensions.

**Cumulative frequency analysis** is the analysis of the frequency of occurrence of values of a phenomenon less than a reference value. The phenomenon may be time- or space-dependent. Cumulative frequency is also called *frequency of non-exceedance*.

**Univariate** is a term commonly used in statistics to describe a type of data which consists of observations on only a single characteristic or attribute. A simple example of univariate data would be the salaries of workers in industry. Like all the other data, univariate data can be visualized using graphs, images or other analysis tools after the data is measured, collected, reported, and analyzed.

This page is based on this Wikipedia article

Text is available under the CC BY-SA 4.0 license; additional terms may apply.

Images, videos and audio are available under their respective licenses.

Text is available under the CC BY-SA 4.0 license; additional terms may apply.

Images, videos and audio are available under their respective licenses.