Discriminative model

Last updated December 20, 2024

Discriminative models, also referred to as conditional models, are a class of models frequently used for classification. They are typically used to solve binary classification problems, i.e. assign labels, such as pass/fail, win/lose, alive/dead or healthy/sick, to existing datapoints.

Types of discriminative models include logistic regression (LR), conditional random fields (CRFs), decision trees among many others. Generative model approaches which uses a joint probability distribution instead, include naive Bayes classifiers, Gaussian mixture models, variational autoencoders, generative adversarial networks and others.

Definition

Unlike generative modelling, which studies the joint probability $P(x,y)$ , discriminative modeling studies the $P(y|x)$ or maps the given unobserved variable (target) $x$ to a class label $y$ dependent on the observed variables (training samples). For example, in object recognition, $x$ is likely to be a vector of raw pixels (or features extracted from the raw pixels of the image). Within a probabilistic framework, this is done by modeling the conditional probability distribution $P(y|x)$ , which can be used for predicting $y$ from $x$ . Note that there is still distinction between the conditional model and the discriminative model, though more often they are simply categorised as discriminative model.

Pure discriminative model vs. conditional model

A conditional model models the conditional probability distribution, while the traditional discriminative model aims to optimize on mapping the input around the most similar trained samples.^[1]

Typical discriminative modelling approaches

The following approach is based on the assumption that it is given the training data-set $D=\{(x_{i};y_{i})|i\leq N\in \mathbb {Z} \}$ , where $y_{i}$ is the corresponding output for the input $x_{i}$ .^[2]

Linear classifier

We intend to use the function $f(x)$ to simulate the behavior of what we observed from the training data-set by the linear classifier method. Using the joint feature vector $\phi (x,y)$ , the decision function is defined as:

f(x;w)=\arg \max _{y}w^{T}\phi (x,y)

According to Memisevic's interpretation,^[2] $w^{T}\phi (x,y)$ , which is also $c(x,y;w)$ , computes a score which measures the compatibility of the input $x$ with the potential output $y$ . Then the $\arg \max$ determines the class with the highest score.

Logistic regression (LR)

Since the 0-1 loss function is a commonly used one in the decision theory, the conditional probability distribution $P(y|x;w)$ , where $w$ is a parameter vector for optimizing the training data, could be reconsidered as following for the logistics regression model:

P(y|x;w)={\frac {1}{Z(x;w)}}\exp(w^{T}\phi (x,y))

, with

Z(x;w)=\textstyle \sum _{y}\displaystyle \exp(w^{T}\phi (x,y))

The equation above represents logistic regression. Notice that a major distinction between models is their way of introducing posterior probability. Posterior probability is inferred from the parametric model. We then can maximize the parameter by following equation:

L(w)=\textstyle \sum _{i}\displaystyle \log p(y^{i}|x^{i};w)

It could also be replaced by the log-loss equation below:

l^{\log }(x^{i},y^{i},c(x^{i};w))=-\log p(y^{i}|x^{i};w)=\log Z(x^{i};w)-w^{T}\phi (x^{i},y^{i})

Since the log-loss is differentiable, a gradient-based method can be used to optimize the model. A global optimum is guaranteed because the objective function is convex. The gradient of log likelihood is represented by:

{\frac {\partial L(w)}{\partial w}}=\textstyle \sum _{i}\displaystyle \phi (x^{i},y^{i})-E_{p(y|x^{i};w)}\phi (x^{i},y)

where $E_{p(y|x^{i};w)}$ is the expectation of $p(y|x^{i};w)$ .

The above method will provide efficient computation for the relative small number of classification.

Contrast with generative model

Contrast in approaches

Let's say we are given the $m$ class labels (classification) and $n$ feature variables, $Y:\{y_{1},y_{2},\ldots ,y_{m}\},X:\{x_{1},x_{2},\ldots ,x_{n}\}$ , as the training samples.

A generative model takes the joint probability $P(x,y)$ , where $x$ is the input and $y$ is the label, and predicts the most possible known label ${\widetilde {y}}\in Y$ for the unknown variable ${\widetilde {x}}$ using Bayes' theorem.^[3]

Discriminative models, as opposed to generative models, do not allow one to generate samples from the joint distribution of observed and target variables. However, for tasks such as classification and regression that do not require the joint distribution, discriminative models can yield superior performance (in part because they have fewer variables to compute).^[4]^[5]^[3] On the other hand, generative models are typically more flexible than discriminative models in expressing dependencies in complex learning tasks. In addition, most discriminative models are inherently supervised and cannot easily support unsupervised learning. Application-specific details ultimately dictate the suitability of selecting a discriminative versus generative model.

Discriminative models and generative models also differ in introducing the posterior possibility.^[6] To maintain the least expected loss, the minimization of result's misclassification should be acquired. In the discriminative model, the posterior probabilities, $P(y|x)$ , is inferred from a parametric model, where the parameters come from the training data. Points of estimation of the parameters are obtained from the maximization of likelihood or distribution computation over the parameters. On the other hand, considering that the generative models focus on the joint probability, the class posterior possibility $P(k)$ is considered in Bayes' theorem, which is

P(y|x)={\frac {p(x|y)p(y)}{\textstyle \sum _{i}p(x|i)p(i)\displaystyle }}={\frac {p(x|y)p(y)}{p(x)}}

.^[6]

Advantages and disadvantages in application

In the repeated experiments, logistic regression and naive Bayes are applied here for different models on binary classification task, discriminative learning results in lower asymptotic errors, while generative one results in higher asymptotic errors faster.^[3] However, in Ulusoy and Bishop's joint work, Comparison of Generative and Discriminative Techniques for Object Detection and Classification, they state that the above statement is true only when the model is the appropriate one for data (i.e.the data distribution is correctly modeled by the generative model).

Advantages

Significant advantages of using discriminative modeling are:

Higher accuracy, which mostly leads to better learning result.
Allows simplification of the input and provides a direct approach to $P(y|x)$
Saves calculation resource
Generates lower asymptotic errors

Compared with the advantages of using generative modeling:

Takes all data into consideration, which could result in slower processing as a disadvantage
Requires fewer training samples
A flexible framework that could easily cooperate with other needs of the application

Disadvantages

Training method usually requires multiple numerical optimization techniques^[1]
Similarly by the definition, the discriminative model will need the combination of multiple subtasks for solving a complex real-world problem^[2]

Optimizations in applications

Since both advantages and disadvantages present on the two way of modeling, combining both approaches will be a good modeling in practice. For example, in Marras' article A Joint Discriminative Generative Model for Deformable Model Construction and Classification,^[7] he and his coauthors apply the combination of two modelings on face classification of the models, and receive a higher accuracy than the traditional approach.

Similarly, Kelm^[8] also proposed the combination of two modelings for pixel classification in his article Combining Generative and Discriminative Methods for Pixel Classification with Multi-Conditional Learning.

During the process of extracting the discriminative features prior to the clustering, Principal component analysis (PCA), though commonly used, is not a necessarily discriminative approach. In contrast, LDA is a discriminative one.^[9] Linear discriminant analysis (LDA), provides an efficient way of eliminating the disadvantage we list above. As we know, the discriminative model needs a combination of multiple subtasks before classification, and LDA provides appropriate solution towards this problem by reducing dimension.

Types

Examples of discriminative models include:

Logistic regression, a type of generalized linear regression used for predicting binary or categorical outputs (also known as maximum entropy classifiers)
Boosting (meta-algorithm)
Conditional random fields
Linear regression
Random forests

Related Research Articles

In machine learning, supervised learning (SL) is a paradigm where a model is trained using input objects and desired output values, which are often human-made labels. The training process builds a function that maps new data to expected output values. An optimal scenario will allow for the algorithm to accurately determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured via a generalization error.

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is $The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is (sigma). A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate .$

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivety) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.

In machine learning, a linear classifier makes a classification decision for each object based on a linear combination of its features. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use.

<span class="mw-page-title-main">Logit</span> Function in statistics

In statistics, the logit function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations.

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression estimates the parameters of a logistic model. In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In probability theory, the Borel–Kolmogorov paradox is a paradox relating to conditional probability with respect to an event of probability zero. It is named after Émile Borel and Andrey Kolmogorov.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. Mixture models are used for clustering, under the name model-based clustering, and also for density estimation.

In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent, but three major types can be distinguished, following Jebara (2004):

A generative model is a statistical model of the joint probability distribution $on a given observable variable X and target variable Y; A generative model can be used to "generate" random instances (outcomes) of an observation x .$
A discriminative model is a model of the conditional probability $of the target Y, given an observation x . It can be used to "discriminate" the value of the target variable Y, given an observation x .$
Classifiers computed without using a probability model are also referred to loosely as "discriminative".

<span class="mw-page-title-main">Markov random field</span> Set of random variables

In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. In other words, a random field is said to be a Markov random field if it satisfies Markov properties. The concept originates from the Sherrington–Kirkpatrick model.

In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.

In information theory, the cross-entropy between two probability distributions $and, over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution, rather than the true distribution .$

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of ⁠ $⁠$ independent Bernoulli trials, where each trial has probability of success ⁠ $⁠$ . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.

In statistics, ordinal regression, also called ordinal classification, is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. It can be considered an intermediate problem between regression and classification. Examples of ordinal regression are ordered logit and ordered probit. Ordinal regression turns up often in the social sciences, for example in the modeling of human levels of preference, as well as in information retrieval. In machine learning, ordinal regression may also be called ranking learning.

<span class="mw-page-title-main">Loss functions for classification</span> Concept in machine learning

In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems. Given $as the space of all possible inputs, and as the set of labels, a typical goal of classification algorithms is to find a function which best predicts a label for a given input . However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same to generate different . As a result, the goal of the learning problem is to minimize expected loss, defined as$

The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function $with a smooth parameter function . Functional Linear Regression, Functional Poisson Regression and Functional Binomial Regression, with the important Functional Logistic Regression included, are special cases of GFLM. Applications of GFLM include classification and discrimination of stochastic processes and functional data.$

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

References

1 2 Ballesteros, Miguel. "Discriminative Models" (PDF). Retrieved October 28, 2018.^{[ permanent dead link ‍]}
1 2 3 Memisevic, Roland (December 21, 2006). "An introduction to structured discriminative learning" . Retrieved October 29, 2018.
1 2 3 Ng, Andrew Y.; Jordan, Michael I. (2001). On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes.
↑ Singla, Parag; Domingos, Pedro (2005). "Discriminative Training of Markov Logic Networks". Proceedings of the 20th National Conference on Artificial Intelligence - Volume 2. AAAI'05. Pittsburgh, Pennsylvania: AAAI Press: 868–873. ISBN 978-1577352365.
↑ J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML, 2001.
1 2 Ulusoy, Ilkay (May 2016). "Comparison of Generative and Discriminative Techniques for Object Detection and Classification" (PDF). Microsoft . Retrieved October 30, 2018.
↑ Marras, Ioannis (2017). "A Joint Discriminative Generative Model for Deformable Model Construction and Classification" (PDF). Retrieved 5 November 2018.
↑ Kelm, B. Michael. "Combining Generative and Discriminative Methods for Pixel Classification with Multi-Conditional Learning" (PDF). Archived from the original (PDF) on 17 July 2019. Retrieved 5 November 2018.
↑ Wang, Zhangyang (2015). "A Joint Optimization Framework of Sparse Coding and Discriminative Clustering" (PDF). Retrieved 5 November 2018.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 Ballesteros, Miguel. "Discriminative Models" (PDF). Retrieved October 28, 2018.^{[ permanent dead link ‍]}

[:1-2] 1 2 3 Memisevic, Roland (December 21, 2006). "An introduction to structured discriminative learning" . Retrieved October 29, 2018.

[:2-3] 1 2 3 Ng, Andrew Y.; Jordan, Michael I. (2001). On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes.

[4] Singla, Parag; Domingos, Pedro (2005). "Discriminative Training of Markov Logic Networks". Proceedings of the 20th National Conference on Artificial Intelligence - Volume 2. AAAI'05. Pittsburgh, Pennsylvania: AAAI Press: 868–873. ISBN 978-1577352365.

[5] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML, 2001.

[:3-6] 1 2 Ulusoy, Ilkay (May 2016). "Comparison of Generative and Discriminative Techniques for Object Detection and Classification" (PDF). Microsoft . Retrieved October 30, 2018.

[7] Marras, Ioannis (2017). "A Joint Discriminative Generative Model for Deformable Model Construction and Classification" (PDF). Retrieved 5 November 2018.

[8] Kelm, B. Michael. "Combining Generative and Discriminative Methods for Pixel Classification with Multi-Conditional Learning" (PDF). Archived from the original (PDF) on 17 July 2019. Retrieved 5 November 2018.

[9] Wang, Zhangyang (2015). "A Joint Optimization Framework of Sparse Coding and Discriminative Clustering" (PDF). Retrieved 5 November 2018.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]