Knowledge distillation

Last updated April 21, 2024

In machine learning, knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one. While large models (such as very deep neural networks or ensembles of many models) have higher knowledge capacity than small models, this capacity might not be fully utilized. It can be just as computationally expensive to evaluate a model even if it utilizes little of its knowledge capacity. Knowledge distillation transfers knowledge from a large model to a smaller model without loss of validity. As smaller models are less expensive to evaluate, they can be deployed on less powerful hardware (such as a mobile device).^[1]

Concept of distillation

Transferring the knowledge from a large to a small model needs to somehow teach to the latter without loss of validity. If both models are trained on the same data, the small model may have insufficient capacity to learn a concise knowledge representation given the same computational resources and same data as the large model. However, some information about a concise knowledge representation is encoded in the pseudolikelihoods assigned to its output: when a model correctly predicts a class, it assigns a large value to the output variable corresponding to such class, and smaller values to the other output variables. The distribution of values among the outputs for a record provides information on how the large model represents knowledge. Therefore, the goal of economical deployment of a valid model can be achieved by training only the large model on the data, exploiting its better ability to learn concise knowledge representations, and then distilling such knowledge into the smaller model, that would not be able to learn it on its own, by training it to learn the soft output of the large model.^[1]

A first example of distilling an artificial neural network into another network dates back to 1992, when Juergen Schmidhuber compressed or collapsed a hierarchy of recurrent neural networks (RNNs) into a single RNN, by distilling a higher level chunker network into a lower level automatizer network.^[6]^[7] This facilitated downstream deep learning.

A related methodology to compress the knowledge of multiple models into a single neural network was called model compression in 2006. Compression was achieved by training a smaller model on large amounts of pseudo-data labelled by a higher-performing ensemble, optimising to match the logit of the compressed model to the logit of the ensemble.^[8] Knowledge distillation is a generalisation of such approach, introduced by Geoffrey Hinton et al. in 2015,^[1] in a preprint that formulated the concept and showed some results achieved in the task of image classification.

Knowledge distillation is also related to the concept of behavioral cloning discussed by Faraz Torabi et. al.^[9]

Formulation

Given a large model as a function of the vector variable $\mathbf {x}$ , trained for a specific classification task, typically the final layer of the network is a softmax in the form

y_{i}(\mathbf {x} |t)={\frac {e^{\frac {z_{i}(\mathbf {x} )}{t}}}{\sum _{j}e^{\frac {z_{j}(\mathbf {x} )}{t}}}}

where $t$ is a parameter called temperature, that for a standard softmax is normally set to 1. The softmax operator converts the logit values $z_{i}(\mathbf {x} )$ to pseudo-probabilities, and higher values of temperature have the effect of generating a softer distribution of pseudo-probabilities among the output classes. Knowledge distillation consists of training a smaller network, called the distilled model, on a dataset called transfer set (different than the dataset used to train the large model) using the cross entropy as loss function between the output of the distilled model $\mathbf {y} (\mathbf {x} |t)$ and the output ${\hat {\mathbf {y} }}(\mathbf {x} |t)$ produced by the large model on the same record (or the average of the individual outputs, if the large model is an ensemble), using a high value of softmax temperature $t$ for both models^[1]

E(\mathbf {x} |t)=-\sum _{i}{\hat {y}}_{i}(\mathbf {x} |t)\log y_{i}(\mathbf {x} |t).

In this context, a high temperature increases the entropy of the output, and therefore provides more information to learn for the distilled model compared to hard targets, at the same time reducing the variance of the gradient between different records and therefore allowing higher learning rates.^[1]

If ground truth is available for the transfer set, the process can be strengthened by adding to the loss the cross-entropy between the output of the distilled model (computed with $t=1$ ) and the known label ${\bar {y}}$

E(\mathbf {x} |t)=-t^{2}\sum _{i}{\hat {y}}_{i}(\mathbf {x} |t)\log y_{i}(\mathbf {x} |t)-\sum _{i}{\bar {y}}_{i}\log {\hat {y}}_{i}(\mathbf {x} |1)

where the component of the loss with respect to the large model is weighted by a factor of $t^{2}$ since, as the temperature increases, the gradient of the loss with respect to the model weights scales by a factor of ${\frac {1}{t^{2}}}$ .^[1]

Relationship with model compression

Under the assumption that the logits have zero mean, it is possible to show that model compression is a special case of knowledge distillation. The gradient of the knowledge distillation loss $E$ with respect to the logit of the distilled model $z_{i}$ is given by

{\begin{aligned}{\frac {\partial }{\partial z_{i}}}E&=-{\frac {\partial }{\partial z_{i}}}\sum _{j}{\hat {y}}_{j}\log y_{j}\\&=-{\frac {\partial }{\partial z_{i}}}{\hat {y}}_{i}\log y_{i}+\left(-{\frac {\partial }{\partial z_{i}}}\sum _{k\neq i}{\hat {y}}_{k}\log y_{k}\right)\\&=-{\hat {y}}_{i}{\frac {1}{y_{i}}}{\frac {\partial }{\partial z_{i}}}y_{i}+\sum _{k\neq i}\left(-{\hat {y}}_{k}\cdot {\frac {1}{y_{k}}}\cdot e^{\frac {z_{k}}{t}}\cdot \left(-{\frac {1}{\left(\sum _{j}e^{\frac {z_{j}}{t}}\right)^{2}}}\right)\cdot e^{\frac {z_{i}}{t}}\cdot {\frac {1}{t}}\right)\\&=-{\hat {y}}_{i}{\frac {1}{y_{i}}}{\frac {\partial }{\partial z_{i}}}{\frac {e^{\frac {z_{i}}{t}}}{\sum _{j}e^{\frac {z_{j}}{t}}}}+\sum _{k\neq i}\left({\hat {y}}_{k}\cdot {\frac {1}{y_{k}}}\cdot y_{k}\cdot y_{i}\cdot {\frac {1}{t}}\right)\\&=-{\hat {y}}_{i}{\frac {1}{y_{i}}}\left({\frac {{\frac {1}{t}}e^{\frac {z_{i}}{t}}\sum _{j}e^{\frac {z_{j}}{t}}-{\frac {1}{t}}\left(e^{\frac {z_{i}}{t}}\right)^{2}}{\left(\sum _{j}e^{\frac {z_{j}}{t}}\right)^{2}}}\right)+{\frac {y_{i}\sum _{k\neq i}{\hat {y}}_{k}}{t}}\\&=-{\hat {y}}_{i}{\frac {1}{y_{i}}}\left({\frac {y_{i}}{t}}-{\frac {y_{i}^{2}}{t}}\right)+{\frac {y_{i}(1-{\hat {y}}_{i})}{t}}\\&={\frac {1}{t}}\left(y_{i}-{\hat {y}}_{i}\right)\\&={\frac {1}{t}}\left({\frac {e^{\frac {z_{i}}{t}}}{\sum _{j}e^{\frac {z_{j}}{t}}}}-{\frac {e^{\frac {{\hat {z}}_{i}}{t}}}{\sum _{j}e^{\frac {{\hat {z}}_{j}}{t}}}}\right)\\\end{aligned}}

where ${\hat {z}}_{i}$ are the logits of the large model. For large values of $t$ this can be approximated as

{\frac {1}{t}}\left({\frac {1+{\frac {z_{i}}{t}}}{N+\sum _{j}{\frac {z_{j}}{t}}}}-{\frac {1+{\frac {{\hat {z}}_{i}}{t}}}{N+\sum _{j}{\frac {{\hat {z}}_{j}}{t}}}}\right)

and under the zero-mean hypothesis $\sum _{j}z_{j}=\sum _{j}{\hat {z}}_{j}=0$ it becomes ${\frac {z_{i}-{\hat {z}}_{i}}{NT^{2}}}$ , which is the derivative of ${\frac {1}{2}}\left(z_{i}-{\hat {z}}_{i}\right)^{2}$ , i.e. the loss is equivalent to matching the logits of the two models, as done in model compression.^[1]

Related Research Articles

In quantum mechanics, the Hamiltonian of a system is an operator corresponding to the total energy of that system, including both kinetic energy and potential energy. Its spectrum, the system's energy spectrum or its set of energy eigenvalues, is the set of possible outcomes obtainable from a measurement of the system's total energy. Due to its close relation to the energy spectrum and time-evolution of a system, it is of fundamental importance in most formulations of quantum theory.

<span class="mw-page-title-main">Navier–Stokes equations</span> Equations describing the motion of viscous fluid substances

The Navier–Stokes equations are partial differential equations which describe the motion of viscous fluid substances. They were named after French engineer and physicist Claude-Louis Navier and the Irish physicist and mathematician George Gabriel Stokes. They were developed over several decades of progressively building the theories, from 1822 (Navier) to 1842–1850 (Stokes).

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974).

In continuum mechanics, the infinitesimal strain theory is a mathematical approach to the description of the deformation of a solid body in which the displacements of the material particles are assumed to be much smaller than any relevant dimension of the body; so that its geometry and the constitutive properties of the material at each point of space can be assumed to be unchanged by the deformation.

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

In condensed matter physics, Bloch's theorem states that solutions to the Schrödinger equation in a periodic potential can be expressed as plane waves modulated by periodic functions. The theorem is named after the Swiss physicist Felix Bloch, who discovered the theorem in 1929. Mathematically, they are written

In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and that the subcomponents are statistically independent from each other. ICA was invented by Jeanny Hérault and Christian Jutten in 1985. ICA is a special case of blind source separation. A common example application of ICA is the "cocktail party problem" of listening in on one person's speech in a noisy room.

This is a list of some vector calculus formulae for working with common curvilinear coordinate systems.

In machine learning, backpropagation is a gradient estimation method used to train neural network models. The gradient estimate is used by the optimization algorithm to compute the network parameter updates.

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set.

In information theory, the cross-entropy between two probability distributions $and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution, rather than the true distribution .$

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

In mathematics, orthogonal coordinates are defined as a set of $d$ coordinates $in which the coordinate hypersurfaces all meet at right angles (note that superscripts are indices, not exponents). A coordinate surface for a particular coordinate q k is the curve, surface, or hypersurface on which q k is a constant. For example, the three-dimensional Cartesian coordinates (x, y, z) is an orthogonal coordinate system, since its coordinate surfaces x = constant, y = constant, and z = constant are planes that meet at right angles to one another, i.e., are perpendicular. Orthogonal coordinates are a special but extremely common case of curvilinear coordinates.$

The softmax function, also known as softargmax or normalized exponential function, converts a vector of $K$ real numbers into a probability distribution of $K$ possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.

<span class="mw-page-title-main">Logit-normal distribution</span>

In probability theory, a logit-normal distribution is a probability distribution of a random variable whose logit has a normal distribution. If Y is a random variable with a normal distribution, and t is the standard logistic function, then X = t(Y) has a logit-normal distribution; likewise, if X is logit-normally distributed, then Y = logit(X)= log (X/(1-X)) is normally distributed. It is also known as the logistic normal distribution, which often refers to a multinomial logit version (e.g.).

In statistics, ordinal regression, also called ordinal classification, is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. It can be considered an intermediate problem between regression and classification. Examples of ordinal regression are ordered logit and ordered probit. Ordinal regression turns up often in the social sciences, for example in the modeling of human levels of preference, as well as in information retrieval. In machine learning, ordinal regression may also be called ranking learning.

Curvilinear coordinates can be formulated in tensor calculus, with important applications in physics and engineering, particularly for describing transportation of physical quantities and deformation of matter in fluid mechanics and continuum mechanics.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.

References

1 2 3 4 5 6 7 Hinton, Geoffrey; Vinyals, Oriol; Dean, Jeff (2015). "Distilling the knowledge in a neural network". arXiv: 1503.02531 [stat.ML].
↑ Chen, Guobin; Choi, Wongun; Yu, Xiang; Han, Tony; Chandraker, Manmohan (2017). "Learning efficient object detection models with knowledge distillation". Advances in Neural Information Processing Systems: 742–751.
↑ Asami, Taichi; Masumura, Ryo; Yamaguchi, Yoshikazu; Masataki, Hirokazu; Aono, Yushi (2017). Domain adaptation of DNN acoustic models using knowledge distillation. IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5185–5189.
↑ Cui, Jia; Kingsbury, Brian; Ramabhadran, Bhuvana; Saon, George; Sercu, Tom; Audhkhasi, Kartik; Sethy, Abhinav; Nussbaum-Thom, Markus; Rosenberg, Andrew (2017). Knowledge distillation across ensembles of multilingual models for low-resource languages. IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4825–4829.
↑ Yang, Yiding; Jiayan, Qiu; Mingli, Song; Dacheng, Tao; Xinchao, Wang (2020). "Distilling Knowledge from Graph Convolutional Networks" (PDF). Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition: 7072–7081. arXiv: 2003.10477 . Bibcode:2020arXiv200310477Y.
↑ Schmidhuber, Jürgen (1992). "Learning complex, extended sequences using the principle of history compression" (PDF). Neural Computation. 4 (2): 234–242. doi:10.1162/neco.1992.4.2.234. S2CID 18271205.
↑ Schmidhuber, Juergen (2022). "Annotated History of Modern AI and Deep Learning". arXiv: 2212.11279 [cs.NE].
↑ Buciluǎ, Cristian; Caruana, Rich; Niculescu-Mizil, Alexandru (2006). "Model compression". Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.
↑ Torabi, Faraz; Warnell, Garrett; Stone, Peter (2018). "Behavioral Cloning from Observation". arXiv: 1805.01954 [cs.AI].

External links

Distilling the knowledge in a neural network – Google AI

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Hinton15-1] 1 2 3 4 5 6 7 Hinton, Geoffrey; Vinyals, Oriol; Dean, Jeff (2015). "Distilling the knowledge in a neural network". arXiv: 1503.02531 [stat.ML].

[2] Chen, Guobin; Choi, Wongun; Yu, Xiang; Han, Tony; Chandraker, Manmohan (2017). "Learning efficient object detection models with knowledge distillation". Advances in Neural Information Processing Systems: 742–751.

[3] Asami, Taichi; Masumura, Ryo; Yamaguchi, Yoshikazu; Masataki, Hirokazu; Aono, Yushi (2017). Domain adaptation of DNN acoustic models using knowledge distillation. IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5185–5189.

[4] Cui, Jia; Kingsbury, Brian; Ramabhadran, Bhuvana; Saon, George; Sercu, Tom; Audhkhasi, Kartik; Sethy, Abhinav; Nussbaum-Thom, Markus; Rosenberg, Andrew (2017). Knowledge distillation across ensembles of multilingual models for low-resource languages. IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4825–4829.

[5] Yang, Yiding; Jiayan, Qiu; Mingli, Song; Dacheng, Tao; Xinchao, Wang (2020). "Distilling Knowledge from Graph Convolutional Networks" (PDF). Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition: 7072–7081. arXiv: 2003.10477 . Bibcode:2020arXiv200310477Y.

[schmidhuber1992-6] Schmidhuber, Jürgen (1992). "Learning complex, extended sequences using the principle of history compression" (PDF). Neural Computation. 4 (2): 234–242. doi:10.1162/neco.1992.4.2.234. S2CID 18271205.

[DLhistory-7] Schmidhuber, Juergen (2022). "Annotated History of Modern AI and Deep Learning". arXiv: 2212.11279 [cs.NE].

[8] Buciluǎ, Cristian; Caruana, Rich; Niculescu-Mizil, Alexandru (2006). "Model compression". Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.

[9] Torabi, Faraz; Warnell, Garrett; Stone, Peter (2018). "Behavioral Cloning from Observation". arXiv: 1805.01954 [cs.AI].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]