Knowledge distillation

Last updated

In machine learning, knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one. While large models (such as very deep neural networks or ensembles of many models) have more knowledge capacity than small models, this capacity might not be fully utilized. It can be just as computationally expensive to evaluate a model even if it utilizes little of its knowledge capacity. Knowledge distillation transfers knowledge from a large model to a smaller one without loss of validity. As smaller models are less expensive to evaluate, they can be deployed on less powerful hardware (such as a mobile device). [1]

Contents

Model distillation is not to be confused with model compression, which describes methods to decrease the size of a large model itself, without training a new model. Model compression generally preserves the architecture and the nominal parameter count of the model, while decreasing the bits-per-parameter.

Knowledge distillation has been successfully used in several applications of machine learning such as object detection, [2] acoustic models, [3] and natural language processing. [4] Recently, it has also been introduced to graph neural networks applicable to non-grid data. [5]

Methods

Knowledge transfer from a large model to a small one somehow needs to teach the latter without loss of validity. If both models are trained on the same data, the smaller model may have insufficient capacity to learn a concise knowledge representation compared to the large model. However, some information about a concise knowledge representation is encoded in the pseudolikelihoods assigned to its output: when a model correctly predicts a class, it assigns a large value to the output variable corresponding to such class, and smaller values to the other output variables. The distribution of values among the outputs for a record provides information on how the large model represents knowledge. Therefore, the goal of economical deployment of a valid model can be achieved by training only the large model on the data, exploiting its better ability to learn concise knowledge representations, and then distilling such knowledge into the smaller model, by training it to learn the soft output of the large model. [1]

Mathematical formulation

Given a large model as a function of the vector variable , trained for a specific classification task, typically the final layer of the network is a softmax in the form

where is the temperature, a parameter which is set to 1 for a standard softmax. The softmax operator converts the logit values to pseudo-probabilities: higher temperature values generate softer distributions of pseudo-probabilities among the output classes. Knowledge distillation consists of training a smaller network, called the distilled model, on a data set called the transfer set (which is different than the data set used to train the large model) using cross-entropy as the loss function between the output of the distilled model and the output of the large model on the same record (or the average of the individual outputs, if the large model is an ensemble), using a high value of softmax temperature for both models [1]

In this context, a high temperature increases the entropy of the output, therefore providing more information to learn for the distilled model compared to hard targets, and at the same time reducing the variance of the gradient between different records, thus allowing a higher learning rate. [1]

If ground truth is available for the transfer set, the process can be strengthened by adding to the loss the cross-entropy between the output of the distilled model (computed with ), and the known label

where the component of the loss with respect to the large model is weighted by a factor of since, as the temperature increases, the gradient of the loss with respect to the model weights scales by a factor of . [1]

Relationship with model compression

Under the assumption that the logits have zero mean, it is possible to show that model compression is a special case of knowledge distillation. The gradient of the knowledge distillation loss with respect to the logit of the distilled model is given by

where are the logits of the large model. For large values of this can be approximated as

and under the zero-mean hypothesis it becomes , which is the derivative of , i.e. the loss is equivalent to matching the logits of the two models, as done in model compression. [1]

Optimal Brain Damage

Optimal Brain Damage (OBD) algorithm is as follows: [6]

Do until a desired level of sparsity or performance is reached:

Train the network (by methods such as backpropagation) until a reasonable solution is obtained

Compute the saliencies for each parameter

Delete some lowest-saliency parameters

Deleting a parameter means fixing the parameter to zero. The "saliency" of a parameter is defined as , where is the loss function. The second-derivative can be computed by second-order backpropagation.

The idea for optimal brain damage is to approximate the loss function in a neighborhood of optimal parameter by Taylor expansion:where , since is optimal, and the cross-derivatives are neglected to save compute. Thus, the saliency of a parameter approximates the increase in loss if that parameter is deleted.

History

A related methodology was model compression or pruning, where a trained network is reduced in size. This was first done in 1965 by Alexey Ivakhnenko and Valentin Lapa in Ukraine (1965). [7] [8] [9] Their deep networks were trained layer by layer through regression analysis. Superfluous hidden units were pruned using a separate validation set. [10] Other neural network compression methods include Biased Weight Decay [11] and Optimal Brain Damage. [6]

An early example of neural network distillation was published by Jürgen Schmidhuber in 1991, in the field of recurrent neural networks (RNNs). The problem was sequence prediction for long sequences, i.e., deep learning. It was solved by two RNNs. One of them (the automatizer) predicted the sequence, and another (the chunker) predicted the errors of the automatizer. Simultaneously, the automatizer predicted the internal states of the chunker. After the automatizer manages to predict the chunker's internal states well, it would start fixing the errors, and soon the chunker is obsoleted, leaving just one RNN in the end. [12] [13]

The idea of using the output of one neural network to train another neural network was also studied as the teacher-student network configuration. [14] In 1992, several papers studied the statistical mechanics of teacher-student configurations with committee machines [15] [16] or both are parity machines. [17]

Compressing the knowledge of multiple models into a single neural network was called model compression in 2006: compression was achieved by training a smaller model on large amounts of pseudo-data labelled by a higher-performing ensemble, optimizing to match the logit of the compressed model to the logit of the ensemble. [18] The knowledge distillation preprint of Geoffrey Hinton et al. (2015) [1] formulated the concept and showed some results achieved in the task of image classification.

Knowledge distillation is also related to the concept of behavioral cloning discussed by Faraz Torabi et. al. [19]

Related Research Articles

<span class="mw-page-title-main">Spherical coordinate system</span> Coordinates comprising a distance and two angles

In mathematics, a spherical coordinate system is a coordinate system for three-dimensional space where the position of a given point in space is specified by three real numbers: the radial distancer along the radial line connecting the point to the fixed point of origin; the polar angleθ between the radial line and a given polar axis; and the azimuthal angleφ as the angle of rotation of the radial line around the polar axis. (See graphic re the "physics convention".) Once the radius is fixed, the three coordinates (r, θ, φ), known as a 3-tuple, provide a coordinate system on a sphere, typically called the spherical polar coordinates. The plane passing through the origin and perpendicular to the polar axis (where the polar angle is a right angle) is called the reference plane (sometimes fundamental plane).

A likelihood function measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the joint probability distribution of the random variable that (presumably) generated the observations. When evaluated on the actual data points, it becomes a function solely of the model parameters.

<span class="mw-page-title-main">Navier–Stokes equations</span> Equations describing the motion of viscous fluid substances

The Navier–Stokes equations are partial differential equations which describe the motion of viscous fluid substances. They were named after French engineer and physicist Claude-Louis Navier and the Irish physicist and mathematician George Gabriel Stokes. They were developed over several decades of progressively building the theories, from 1822 (Navier) to 1842–1850 (Stokes).

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In mathematics, a unit vector in a normed vector space is a vector of length 1. A unit vector is often denoted by a lowercase letter with a circumflex, or "hat", as in .

In continuum mechanics, the infinitesimal strain theory is a mathematical approach to the description of the deformation of a solid body in which the displacements of the material particles are assumed to be much smaller than any relevant dimension of the body; so that its geometry and the constitutive properties of the material at each point of space can be assumed to be unchanged by the deformation.

<span class="mw-page-title-main">Spherical harmonics</span> Special mathematical functions defined on the surface of a sphere

In mathematics and physical science, spherical harmonics are special functions defined on the surface of a sphere. They are often employed in solving partial differential equations in many scientific fields. The table of spherical harmonics contains a list of common spherical harmonics.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression estimates the parameters of a logistic model. In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

Linear elasticity is a mathematical model as to how solid objects deform and become internally stressed by prescribed loading conditions. It is a simplification of the more general nonlinear theory of elasticity and a branch of continuum mechanics.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

In analytical mechanics, generalized coordinates are a set of parameters used to represent the state of a system in a configuration space. These parameters must uniquely define the configuration of the system relative to a reference state. The generalized velocities are the time derivatives of the generalized coordinates of the system. The adjective "generalized" distinguishes these parameters from the traditional use of the term "coordinate" to refer to Cartesian coordinates.

This is a list of some vector calculus formulae for working with common curvilinear coordinate systems.

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees. Random forests correct for decision trees' habit of overfitting to their training set.

In information theory, the cross-entropy between two probability distributions and , over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution .

In mathematics, vector spherical harmonics (VSH) are an extension of the scalar spherical harmonics for use with vector fields. The components of the VSH are complex-valued functions expressed in the spherical coordinate basis vectors.

Mechanics of planar particle motion is the analysis of the motion of particles gravitationally attracted to one another which are observed from non-inertial reference frames and the generalization of this problem to planetary motion. This type of analysis is closely related to centrifugal force, two-body problem, orbit and Kepler's laws of planetary motion. The mechanics of planar particle motion fall within the general field of analytical dynamics, and help to determine orbits from the force laws. This article is focused more on the kinematic issues surrounding planar motion, which are the determination of the forces necessary to result in a certain trajectory given the particle trajectory.

Curvilinear coordinates can be formulated in tensor calculus, with important applications in physics and engineering, particularly for describing transportation of physical quantities and deformation of matter in fluid mechanics and continuum mechanics.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

An energy-based model (EBM) is an application of canonical ensemble formulation from statistical physics for learning from data. The approach prominently appears in generative artificial intelligence.

<span class="mw-page-title-main">Hyperbolastic functions</span> Mathematical functions

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

References

  1. 1 2 3 4 5 6 7 Hinton, Geoffrey; Vinyals, Oriol; Dean, Jeff (2015). "Distilling the knowledge in a neural network". arXiv: 1503.02531 [stat.ML].
  2. Chen, Guobin; Choi, Wongun; Yu, Xiang; Han, Tony; Chandraker, Manmohan (2017). "Learning efficient object detection models with knowledge distillation". Advances in Neural Information Processing Systems: 742–751.
  3. Asami, Taichi; Masumura, Ryo; Yamaguchi, Yoshikazu; Masataki, Hirokazu; Aono, Yushi (2017). Domain adaptation of DNN acoustic models using knowledge distillation. IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5185–5189.
  4. Cui, Jia; Kingsbury, Brian; Ramabhadran, Bhuvana; Saon, George; Sercu, Tom; Audhkhasi, Kartik; Sethy, Abhinav; Nussbaum-Thom, Markus; Rosenberg, Andrew (2017). Knowledge distillation across ensembles of multilingual models for low-resource languages. IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4825–4829.
  5. Yang, Yiding; Jiayan, Qiu; Mingli, Song; Dacheng, Tao; Xinchao, Wang (2020). "Distilling Knowledge from Graph Convolutional Networks" (PDF). Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition: 7072–7081. arXiv: 2003.10477 . Bibcode:2020arXiv200310477Y.
  6. 1 2 LeCun, Yann; Denker, John; Solla, Sara (1989). "Optimal Brain Damage". Advances in Neural Information Processing Systems. 2. Morgan-Kaufmann.
  7. Ivakhnenko, A. G.; Lapa, V. G. (1967). Cybernetics and Forecasting Techniques. American Elsevier Publishing Co. ISBN   978-0-444-00020-0.
  8. Ivakhnenko, A.G. (March 1970). "Heuristic self-organization in problems of engineering cybernetics". Automatica. 6 (2): 207–219. doi:10.1016/0005-1098(70)90092-0.
  9. Ivakhnenko, Alexey (1971). "Polynomial theory of complex systems" (PDF). IEEE Transactions on Systems, Man, and Cybernetics. SMC-1 (4): 364–378. doi:10.1109/TSMC.1971.4308320. Archived (PDF) from the original on 2017-08-29. Retrieved 2019-11-05.
  10. Schmidhuber, Jürgen (2022). "Annotated History of Modern AI and Deep Learning". arXiv: 2212.11279 [cs.NE].
  11. Hanson, Stephen; Pratt, Lorien (1988). "Comparing Biases for Minimal Network Construction with Back-Propagation". Advances in Neural Information Processing Systems. 1. Morgan-Kaufmann.
  12. Schmidhuber, Jürgen (April 1991). "Neural Sequence Chunkers" (PDF). TR FKI-148, TU Munich.
  13. Schmidhuber, Jürgen (1992). "Learning complex, extended sequences using the principle of history compression" (PDF). Neural Computation. 4 (2): 234–242. doi:10.1162/neco.1992.4.2.234. S2CID   18271205.
  14. Watkin, Timothy L. H.; Rau, Albrecht; Biehl, Michael (1993-04-01). "The statistical mechanics of learning a rule". Reviews of Modern Physics. 65 (2): 499–556. Bibcode:1993RvMP...65..499W. doi:10.1103/RevModPhys.65.499.
  15. Schwarze, H; Hertz, J (1992-10-15). "Generalization in a Large Committee Machine". Europhysics Letters (EPL). 20 (4): 375–380. Bibcode:1992EL.....20..375S. doi:10.1209/0295-5075/20/4/015. ISSN   0295-5075.
  16. Mato, G; Parga, N (1992-10-07). "Generalization properties of multilayered neural networks". Journal of Physics A: Mathematical and General. 25 (19): 5047–5054. Bibcode:1992JPhA...25.5047M. doi:10.1088/0305-4470/25/19/017. ISSN   0305-4470.
  17. Hansel, D; Mato, G; Meunier, C (1992-11-01). "Memorization Without Generalization in a Multilayered Neural Network". Europhysics Letters (EPL). 20 (5): 471–476. Bibcode:1992EL.....20..471H. doi:10.1209/0295-5075/20/5/015. ISSN   0295-5075.
  18. Buciluǎ, Cristian; Caruana, Rich; Niculescu-Mizil, Alexandru (2006). "Model compression". Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.
  19. Torabi, Faraz; Warnell, Garrett; Stone, Peter (2018). "Behavioral Cloning from Observation". arXiv: 1805.01954 [cs.AI].