Probability mass function

Last updated January 28, 2024

In probability and statistics, a probability mass function (sometimes called probability function or frequency function^[1]) is a function that gives the probability that a discrete random variable is exactly equal to some value.^[2] Sometimes it is also known as the discrete probability density function. The probability mass function is often the primary means of defining a discrete probability distribution, and such functions exist for either scalar or multivariate random variables whose domain is discrete.

Formal definition

Probability mass function is the probability distribution of a discrete random variable, and provides the possible values and their associated probabilities. It is the function $p:\mathbb {R} \to [0,1]$ defined by

$p_{X}(x)=P(X=x)$

for $-\infty <x<\infty$ ,^[3] where $P$ is a probability measure. $p_{X}(x)$ can also be simplified as $p(x)$ .^[4]

The probabilities associated with all (hypothetical) values must be non-negative and sum up to 1,

\sum _{x}p_{X}(x)=1

and

p_{X}(x)\geq 0.

Thinking of probability as mass helps to avoid mistakes since the physical mass is conserved as is the total probability for all hypothetical outcomes $x$ .

Measure theoretic formulation

A probability mass function of a discrete random variable $X$ can be seen as a special case of two more general measure theoretic constructions: the distribution of $X$ and the probability density function of $X$ with respect to the counting measure. We make this more precise below.

Suppose that $(A,{\mathcal {A}},P)$ is a probability space and that $(B,{\mathcal {B}})$ is a measurable space whose underlying σ-algebra is discrete, so in particular contains singleton sets of $B$ . In this setting, a random variable $X\colon A\to B$ is discrete provided its image is countable. The pushforward measure $X_{*}(P)$ —called the distribution of $X$ in this context—is a probability measure on $B$ whose restriction to singleton sets induces the probability mass function (as mentioned in the previous section) $f_{X}\colon B\to \mathbb {R}$ since $f_{X}(b)=P(X^{-1}(b))=P(X=b)$ for each $b\in B$ .

Now suppose that $(B,{\mathcal {B}},\mu )$ is a measure space equipped with the counting measure μ. The probability density function $f$ of $X$ with respect to the counting measure, if it exists, is the Radon–Nikodym derivative of the pushforward measure of $X$ (with respect to the counting measure), so $f=dX_{*}P/d\mu$ and $f$ is a function from $B$ to the non-negative reals. As a consequence, for any $b\in B$ we have

P(X=b)=P(X^{-1}(b))=X_{*}(P)(b)=\int _{b}fd\mu =f(b),

demonstrating that $f$ is in fact a probability mass function.

When there is a natural order among the potential outcomes $x$ , it may be convenient to assign numerical values to them (or n-tuples in case of a discrete multivariate random variable) and to consider also values not in the image of $X$ . That is, $f_{X}$ may be defined for all real numbers and $f_{X}(x)=0$ for all $x\notin X(S)$ as shown in the figure.

The image of $X$ has a countable subset on which the probability mass function $f_{X}(x)$ is one. Consequently, the probability mass function is zero for all but a countable number of values of $x$ .

The discontinuity of probability mass functions is related to the fact that the cumulative distribution function of a discrete random variable is also discontinuous. If $X$ is a discrete random variable, then $P(X=x)=1$ means that the casual event $(X=x)$ is certain (it is true in 100% of the occurrences); on the contrary, $P(X=x)=0$ means that the casual event $(X=x)$ is always impossible. This statement isn't true for a continuous random variable $X$ , for which $P(X=x)=0$ for any possible $x$ . Discretization is the process of converting a continuous random variable into a discrete one.

Examples

Finite

There are three major distributions associated, the Bernoulli distribution, the binomial distribution and the geometric distribution.

Bernoulli distribution: ber(p) , is used to model an experiment with only two possible outcomes. The two outcomes are often encoded as 1 and 0. $p_{X}(x)={\begin{cases}p,&{\text{if }}x{\text{ is 1}}\\1-p,&{\text{if }}x{\text{ is 0}}\end{cases}}$ An example of the Bernoulli distribution is tossing a coin. Suppose that $S$ is the sample space of all outcomes of a single toss of a fair coin, and $X$ is the random variable defined on $S$ assigning 0 to the category "tails" and 1 to the category "heads". Since the coin is fair, the probability mass function is $p_{X}(x)={\begin{cases}{\frac {1}{2}},&x=0,\\{\frac {1}{2}},&x=1,\\0,&x\notin \{0,1\}.\end{cases}}$
Binomial distribution, models the number of successes when someone draws n times with replacement. Each draw or experiment is independent, with two possible outcomes. The associated probability mass function is ${\textstyle {\binom {n}{k}}p^{k}(1-p)^{n-k}}$ .

The probability mass function of a fair die. All the numbers on the die have an equal chance of appearing on top when the die stops rolling.
An example of the binomial distribution is the probability of getting exactly one 6 when someone rolls a fair die three times.
Geometric distribution describes the number of trials needed to get one success. Its probability mass function is ${\textstyle p_{X}(k)=(1-p)^{k-1}p}$ .
An example is tossing a coin until the first "heads" appears. $p$ denotes the probability of the outcome "heads", and $k$ denotes the number of necessary coin tosses.
Other distributions that can be modeled using a probability mass function are the categorical distribution (also known as the generalized Bernoulli distribution) and the multinomial distribution.
If the discrete distribution has two or more categories one of which may occur, whether or not these categories have a natural ordering, when there is only a single trial (draw) this is a categorical distribution.
An example of a multivariate discrete distribution, and of its probability mass function, is provided by the multinomial distribution. Here the multiple random variables are the numbers of successes in each of the categories after a given number of trials, and each non-zero probability mass gives the probability of a certain combination of numbers of successes in the various categories.

Infinite

The following exponentially declining distribution is an example of a distribution with an infinite number of possible outcomes—all the positive integers:

{\text{Pr}}(X=i)={\frac {1}{2^{i}}}\qquad {\text{for }}i=1,2,3,\dots

Despite the infinite number of possible outcomes, the total probability mass is 1/2 + 1/4 + 1/8 + ⋯ = 1, satisfying the unit total probability requirement for a probability distribution.

Multivariate case

Two or more discrete random variables have a joint probability mass function, which gives the probability of each possible combination of realizations for the random variables.

Related Research Articles

In probability theory, the expected value is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of the possible values a random variable can take, weighted by the probability of those outcomes. Since it is obtained through arithmetic, the expected value sometimes may not even be included in the sample data set; it is not the value you would "expect" to get in reality.

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable $, which takes values in the alphabet and is distributed according to :$

Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set of axioms. Typically these axioms formalise probability in terms of a probability space, which assigns a measure taking values between 0 and 1, termed the probability measure, to a set of outcomes called the sample space. Any specified subset of the sample space is called an event.

In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon in terms of its sample space and the probabilities of events.

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' can be misleading as its mathematical definition is not actually random nor a variable, but rather it is a function from possible outcomes in a sample space to a measurable space, often to the real numbers.

In probability theory, a probability space or a probability triple $is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models the throwing of a die.$

In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would be equal to that sample. Probability density is the probability per unit length, in other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0, the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would be close to one sample compared to the other sample.

In probability and statistics, a Bernoulli process is a finite or infinite sequence of binary random variables, so it is a discrete-time stochastic process that takes only two values, canonically 0 and 1. The component Bernoulli variablesX_i are identically distributed and independent. Prosaically, a Bernoulli process is a repeated coin flipping, possibly with an unfair coin. Every variable X_i in the sequence is associated with a Bernoulli trial or experiment. They all have the same Bernoulli distribution. Much of what can be said about the Bernoulli process can also be generalized to more than two outcomes ; this generalization is known as the Bernoulli scheme.

In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable which takes the value 1 with probability $and the value 0 with probability . Less formally, it can be thought of as a model for the set of possible outcomes of any single experiment that asks a yes-no question. Such questions lead to outcomes that are boolean-valued: a single bit whose value is success/yes/true/one with probability p and failure/no/false/zero with probability q . It can be used to represent a coin toss where 1 and 0 would represent "heads" and "tails", respectively, and p would be the probability of the coin landing on heads. In particular, unfair coins would have$

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In mathematics, the moments of a function are certain quantitative measures related to the shape of the function's graph. If the function represents mass density, then the zeroth moment is the total mass, the first moment is the center of mass, and the second moment is the moment of inertia. If the function is a probability distribution, then the first moment is the expected value, the second central moment is the variance, the third standardized moment is the skewness, and the fourth standardized moment is the kurtosis. The mathematical concept is closely related to the concept of moment in physics.

In probability theory and statistics, the conditional probability distribution is a probability distribution that describes the probability of an outcome given the occurrence of a particular event. Given two jointly distributed random variables $and, the conditional probability distribution of given is the probability distribution of when is known to be a particular value; in some cases the conditional probabilities may be expressed as functions containing the unspecified value of as a parameter. When both and are categorical variables, a conditional probability table is typically used to represent the conditional probability. The conditional distribution contrasts with the marginal distribution of a random variable, which is its distribution without reference to the value of the other variable.$

In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection according to given probabilities of selection, and then the value of the selected random variable is realized. The underlying random variables may be random real numbers, or they may be random vectors, in which case the mixture distribution is a multivariate distribution.

Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered for any given number of random variables. The joint distribution encodes the marginal distributions, i.e. the distributions of each of the individual random variables and the conditional probability distributions, which deal with how the outputs of one random variable are distributed when given information on the outputs of the other random variable(s).

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

In mathematics, the Bernoulli scheme or Bernoulli shift is a generalization of the Bernoulli process to more than two possible outcomes. Bernoulli schemes appear naturally in symbolic dynamics, and are thus important in the study of dynamical systems. Many important dynamical systems exhibit a repellor that is the product of the Cantor set and a smooth manifold, and the dynamics on the Cantor set are isomorphic to that of the Bernoulli shift. This is essentially the Markov partition. The term shift is in reference to the shift operator, which may be used to study Bernoulli schemes. The Ornstein isomorphism theorem shows that Bernoulli shifts are isomorphic when their entropy is equal.

In probability theory, random element is a generalization of the concept of random variable to more complicated spaces than the simple real line. The concept was introduced by Maurice Fréchet (1948) who commented that the “development of probability theory and expansion of area of its applications have led to necessity to pass from schemes where (random) outcomes of experiments can be described by number or a finite set of numbers, to schemes where outcomes of experiments represent, for example, vectors, functions, processes, fields, series, transformations, and also sets or collections of sets.”

This article discusses how information theory is related to measure theory.

In probability theory and statistics, a categorical distribution is a discrete probability distribution that describes the possible results of a random variable that can take on one of K possible categories, with the probability of each category separately specified. There is no innate underlying ordering of these outcomes, but numerical labels are often attached for convenience in describing the distribution,. The K-dimensional categorical distribution is the most general distribution over a K-way event; any other discrete distribution over a size-K sample space is a special case. The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1.

In probability theory and statistics, the law of the unconscious statistician, or LOTUS, is a theorem which expresses the expected value of a function $g (X)$ of a random variable $X$ in terms of $g$ and the probability distribution of $X$ .

References

↑ 7.2 - Probability Mass Functions | STAT 414 - PennState - Eberly College of Science
↑ Stewart, William J. (2011). Probability, Markov Chains, Queues, and Simulation: The Mathematical Basis of Performance Modeling. Princeton University Press. p. 105. ISBN 978-1-4008-3281-1.
1 2 A modern introduction to probability and statistics : understanding why and how. Dekking, Michel, 1946-. London: Springer. 2005. ISBN 978-1-85233-896-1. OCLC 262680588.{{cite book}}: CS1 maint: others (link)
↑ Rao, Singiresu S. (1996). Engineering optimization : theory and practice (3rd ed.). New York: Wiley. ISBN 0-471-55034-5. OCLC 62080932.

v t e Theory of probability distributions
probability mass function (pmf) probability density function (pdf) cumulative distribution function (cdf) quantile function
raw moment central moment mean variance standard deviation skewness kurtosis L-moment
moment-generating function (mgf) characteristic function probability-generating function (pgf) cumulant combinant