Weak supervision

Last updated

Weak supervision (also known as semi-supervised learning) is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is characterized by using a combination of a small amount of human-labeled data (exclusively used in more expensive and time-consuming supervised learning paradigm), followed by a large amount of unlabeled data (used exclusively in unsupervised learning paradigm). In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam.

Contents

Problem

Tendency for a task to employ supervised vs. unsupervised methods. Task names straddling circle boundaries is intentional. It shows that the classical division of imaginative tasks (left) employing unsupervised methods is blurred in today's learning schemes. Task-guidance.png
Tendency for a task to employ supervised vs. unsupervised methods. Task names straddling circle boundaries is intentional. It shows that the classical division of imaginative tasks (left) employing unsupervised methods is blurred in today's learning schemes.

The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render large, fully labeled training sets infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning.

Technique

An example of the influence of unlabeled data in semi-supervised learning. The top panel shows a decision boundary we might adopt after seeing only one positive (white circle) and one negative (black circle) example. The bottom panel shows a decision boundary we might adopt if, in addition to the two labeled examples, we were given a collection of unlabeled data (gray circles). Example of unlabeled data in semisupervised learning.png
An example of the influence of unlabeled data in semi-supervised learning. The top panel shows a decision boundary we might adopt after seeing only one positive (white circle) and one negative (black circle) example. The bottom panel shows a decision boundary we might adopt if, in addition to the two labeled examples, we were given a collection of unlabeled data (gray circles).

More formally, semi-supervised learning assumes a set of independently identically distributed examples with corresponding labels and unlabeled examples are processed. Semi-supervised learning combines this information to surpass the classification performance that can be obtained either by discarding the unlabeled data and doing supervised learning or by discarding the labels and doing unsupervised learning.

Semi-supervised learning may refer to either transductive learning or inductive learning. [1] The goal of transductive learning is to infer the correct labels for the given unlabeled data only. The goal of inductive learning is to infer the correct mapping from to .

It is unnecessary (and, according to Vapnik's principle, imprudent) to perform transductive learning by way of inferring a classification rule over the entire input space; however, in practice, algorithms formally designed for transduction or induction are often used interchangeably.

Assumptions

In order to make any use of unlabeled data, some relationship to the underlying distribution of data must exist. Semi-supervised learning algorithms make use of at least one of the following assumptions: [2]

Continuity / smoothness assumption

Points that are close to each other are more likely to share a label. This is also generally assumed in supervised learning and yields a preference for geometrically simple decision boundaries. In the case of semi-supervised learning, the smoothness assumption additionally yields a preference for decision boundaries in low-density regions, so few points are close to each other but in different classes. [3]

Cluster assumption

The data tend to form discrete clusters, and points in the same cluster are more likely to share a label (although data that shares a label may spread across multiple clusters). This is a special case of the smoothness assumption and gives rise to feature learning with clustering algorithms.

Manifold assumption

The data lie approximately on a manifold of much lower dimension than the input space. In this case learning the manifold using both the labeled and unlabeled data can avoid the curse of dimensionality. Then learning can proceed using distances and densities defined on the manifold.

The manifold assumption is practical when high-dimensional data are generated by some process that may be hard to model directly, but which has only a few degrees of freedom. For instance, human voice is controlled by a few vocal folds, [4] and images of various facial expressions are controlled by a few muscles. In these cases, it is better to consider distances and smoothness in the natural space of the generating problem, rather than in the space of all possible acoustic waves or images, respectively.

History

The heuristic approach of self-training (also known as self-learning or self-labeling) is historically the oldest approach to semi-supervised learning, [2] with examples of applications starting in the 1960s. [5]

The transductive learning framework was formally introduced by Vladimir Vapnik in the 1970s. [6] Interest in inductive learning using generative models also began in the 1970s. A probably approximately correct learning bound for semi-supervised learning of a Gaussian mixture was demonstrated by Ratsaby and Venkatesh in 1995. [7]

Methods

Generative models

Generative approaches to statistical learning first seek to estimate ,[ disputed discuss ] the distribution of data points belonging to each class. The probability that a given point has label is then proportional to by Bayes' rule. Semi-supervised learning with generative models can be viewed either as an extension of supervised learning (classification plus information about ) or as an extension of unsupervised learning (clustering plus some labels).

Generative models assume that the distributions take some particular form parameterized by the vector . If these assumptions are incorrect, the unlabeled data may actually decrease the accuracy of the solution relative to what would have been obtained from labeled data alone. [8] However, if the assumptions are correct, then the unlabeled data necessarily improves performance. [7]

The unlabeled data are distributed according to a mixture of individual-class distributions. In order to learn the mixture distribution from the unlabeled data, it must be identifiable, that is, different parameters must yield different summed distributions. Gaussian mixture distributions are identifiable and commonly used for generative models.

The parameterized joint distribution can be written as by using the chain rule. Each parameter vector is associated with a decision function . The parameter is then chosen based on fit to both the labeled and unlabeled data, weighted by :

[9]

Low-density separation

Another major class of methods attempts to place boundaries in regions with few data points (labeled or unlabeled). One of the most commonly used algorithms is the transductive support vector machine, or TSVM (which, despite its name, may be used for inductive learning as well). Whereas support vector machines for supervised learning seek a decision boundary with maximal margin over the labeled data, the goal of TSVM is a labeling of the unlabeled data such that the decision boundary has maximal margin over all of the data. In addition to the standard hinge loss for labeled data, a loss function is introduced over the unlabeled data by letting . TSVM then selects from a reproducing kernel Hilbert space by minimizing the regularized empirical risk:

An exact solution is intractable due to the non-convex term , so research focuses on useful approximations. [9]

Other approaches that implement low-density separation include Gaussian process models, information regularization, and entropy minimization (of which TSVM is a special case).

Laplacian regularization

Laplacian regularization has been historically approached through graph-Laplacian. Graph-based methods for semi-supervised learning use a graph representation of the data, with a node for each labeled and unlabeled example. The graph may be constructed using domain knowledge or similarity of examples; two common methods are to connect each data point to its nearest neighbors or to examples within some distance . The weight of an edge between and is then set to .

Within the framework of manifold regularization, [10] [11] the graph serves as a proxy for the manifold. A term is added to the standard Tikhonov regularization problem to enforce smoothness of the solution relative to the manifold (in the intrinsic space of the problem) as well as relative to the ambient input space. The minimization problem becomes

[9]

where is a reproducing kernel Hilbert space and is the manifold on which the data lie. The regularization parameters and control smoothness in the ambient and intrinsic spaces respectively. The graph is used to approximate the intrinsic regularization term. Defining the graph Laplacian where and is the vector , we have

.

The graph-based approach to Laplacian regularization is to put in relation with finite difference method.[ clarification needed ][ citation needed ]

The Laplacian can also be used to extend the supervised learning algorithms: regularized least squares and support vector machines (SVM) to semi-supervised versions Laplacian regularized least squares and Laplacian SVM.

Heuristic approaches

Some methods for semi-supervised learning are not intrinsically geared to learning from both unlabeled and labeled data, but instead make use of unlabeled data within a supervised learning framework. For instance, the labeled and unlabeled examples may inform a choice of representation, distance metric, or kernel for the data in an unsupervised first step. Then supervised learning proceeds from only the labeled examples. In this vein, some methods learn a low-dimensional representation using the supervised data and then apply either low-density separation or graph-based methods to the learned representation. [12] [13] Iteratively refining the representation and then performing semi-supervised learning on said representation may further improve performance.

Self-training is a wrapper method for semi-supervised learning. [14] First a supervised learning algorithm is trained based on the labeled data only. This classifier is then applied to the unlabeled data to generate more labeled examples as input for the supervised learning algorithm. Generally only the labels the classifier is most confident in are added at each step. [15] In natural language processing, a common self-training algorithm is the Yarowsky algorithm for problems like word sense disambiguation, accent restoration, and spelling correction. [16]

Co-training is an extension of self-training in which multiple classifiers are trained on different (ideally disjoint) sets of features and generate labeled examples for one another. [17]

In human cognition

Human responses to formal semi-supervised learning problems have yielded varying conclusions about the degree of influence of the unlabeled data. [18] More natural learning problems may also be viewed as instances of semi-supervised learning. Much of human concept learning involves a small amount of direct instruction (e.g. parental labeling of objects during childhood) combined with large amounts of unlabeled experience (e.g. observation of objects without naming or counting them, or at least without feedback).

Human infants are sensitive to the structure of unlabeled natural categories such as images of dogs and cats or male and female faces. [19] Infants and children take into account not only unlabeled examples, but the sampling process from which labeled examples arise. [20] [21]

See also

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> Machine learning paradigm

In machine learning, supervised learning (SL) is a paradigm where a model is trained using input objects and desired output values, which are often human-made labels. The training process builds a function that maps new data to expected output values. An optimal scenario will allow for the algorithm to accurately determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured via a generalization error.

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories, SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

In mathematics, the Laplace operator or Laplacian is a differential operator given by the divergence of the gradient of a scalar function on Euclidean space. It is usually denoted by the symbols , (where is the nabla operator), or . In a Cartesian coordinate system, the Laplacian is given by the sum of second partial derivatives of the function with respect to each independent variable. In other coordinate systems, such as cylindrical and spherical coordinates, the Laplacian also has a useful form. Informally, the Laplacian Δf (p) of a function f at a point p measures by how much the average value of f over small spheres or balls centered at p deviates from f (p).

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Inherently, Multi-task learning is a multi-objective optimization problem having trade-offs between different tasks. Early versions of MTL were called "hints".

Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bioinformatics.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how much a model probability distribution Q is different from a true probability distribution P. Mathematically, it is defined as

<span class="mw-page-title-main">Regularization (mathematics)</span> Technique to make a model more generalizable and transferable

In mathematics, statistics, finance, and computer science, particularly in machine learning and inverse problems, regularization is a process that converts the answer of a problem to a simpler one. It is often used in solving ill-posed problems or to prevent overfitting.

<span class="mw-page-title-main">Complex torus</span>

In mathematics, a complex torus is a particular kind of complex manifold M whose underlying smooth manifold is a torus in the usual sense. Here N must be the even number 2n, where n is the complex dimension of M.

<span class="mw-page-title-main">Autoencoder</span> Neural network that learns efficient data encoding in an unsupervised manner

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.

Pattern recognition is a very active field of research intimately bound to machine learning. Also known as classification or statistical classification, pattern recognition aims at building a classifier that can determine the class of an input pattern. This procedure, known as training, corresponds to learning an unknown decision function based only on a set of input-output pairs that form the training data. Nonetheless, in real world applications such as character recognition, a certain amount of information on the problem is usually known beforehand. The incorporation of this prior knowledge into the training is the key element that will allow an increase of performance in many applications.

Structural risk minimization (SRM) is an inductive principle of use in machine learning. Commonly in machine learning, a generalized model must be selected from a finite data set, with the consequent problem of overfitting – the model becoming too strongly tailored to the particularities of the training set and generalizing poorly to new data. The SRM principle addresses this problem by balancing the model's complexity against its success at fitting the training data. This principle was first set out in a 1974 book by Vladimir Vapnik and Alexey Chervonenkis and uses the VC dimension.

The image segmentation problem is concerned with partitioning an image into multiple regions according to some homogeneity criterion. This article is primarily concerned with graph theoretic approaches to image segmentation applying graph partitioning via minimum cut or maximum cut. Segmentation-based object categorization can be viewed as a specific case of spectral clustering applied to image segmentation.

Within mathematical analysis, Regularization perspectives on support-vector machines provide a way of interpreting support-vector machines (SVMs) in the context of other regularization-based machine-learning algorithms. SVM algorithms categorize binary data, with the goal of fitting the training set data in a way that minimizes the average of the hinge-loss function and L2 norm of the learned weights. This strategy avoids overfitting via Tikhonov regularization and in the L2 norm sense and also corresponds to minimizing the bias and variance of our estimator of the weights. Estimators with lower Mean squared error predict better or generalize better when given unseen data.

Spectral regularization is any of a class of regularization techniques used in machine learning to control the impact of noise and prevent overfitting. Spectral regularization can be used in a broad range of applications, from deblurring images to classifying emails into a spam folder and a non-spam folder. For instance, in the email classification example, spectral regularization can be used to reduce the impact of noise and prevent overfitting when a machine learning system is being trained on a labeled set of emails to learn how to tell a spam and a non-spam email apart.

Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.

<span class="mw-page-title-main">Manifold regularization</span>

In machine learning, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire input space. For example, a facial recognition system may not need to classify any possible image, but only the subset of images that contain faces. The technique of manifold learning assumes that the relevant subset of data comes from a manifold, a mathematical structure with useful properties. The technique also assumes that the function to be learned is smooth: data with different labels are not likely to be close together, and so the labeling function should not change quickly in areas where there are likely to be many data points. Because of this assumption, a manifold regularization algorithm can use unlabeled data to inform where the learned function is allowed to change quickly and where it is not, using an extension of the technique of Tikhonov regularization. Manifold regularization algorithms can extend supervised learning algorithms in semi-supervised learning and transductive learning settings, where unlabeled data are available. The technique has been used for applications including medical imaging, geographical imaging, and object recognition.

A flow-based generative model is a generative model used in machine learning that explicitly models a probability distribution by leveraging normalizing flow, which is a statistical method using the change-of-variable law of probabilities to transform a simple distribution into a complex one.

Flag algebras are an important computational tool in the field of graph theory which have a wide range of applications in homomorphism density and related topics. Roughly, they formalize the notion of adding and multiplying homomorphism densities and set up a framework to solve graph homomorphism inequalities with computers by reducing them to semidefinite programming problems. Originally introduced by Alexander Razborov in a 2007 paper, the method has since come to solve numerous difficult, previously unresolved graph theoretic questions. These include the question regarding the region of feasible edge density, triangle density pairs and the maximum number of pentagons in triangle free graphs.

<span class="mw-page-title-main">Chambolle-Pock algorithm</span> Primal-Dual algorithm optimization for convex problems

In mathematics, the Chambolle-Pock algorithm is an algorithm used to solve convex optimization problems. It was introduced by Antonin Chambolle and Thomas Pock in 2011 and has since become a widely used method in various fields, including image processing, computer vision, and signal processing.

References

  1. Semi-Supervised Learning Literature Survey, Page 5, 2007, CiteSeerX   10.1.1.99.9681
  2. 1 2 Chapelle, Schölkopf & Zien 2006.
  3. Chawla, N., Bowyer, K., Hall, L.O., & Kegelmeyer, W.P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. ArXiv, abs/1106.1813.
  4. Stevens, Kenneth N. (1998). Acoustic phonetics. Cambridge, Mass.: MIT Press. ISBN   0-585-08720-2. OCLC   42856189.
  5. Scudder, H. (July 1965). "Probability of error of some adaptive pattern-recognition machines". IEEE Transactions on Information Theory. 11 (3): 363–371. doi:10.1109/TIT.1965.1053799. ISSN   1557-9654.
  6. Vapnik, V.; Chervonenkis, A. (1974). Theory of Pattern Recognition (in Russian). Moscow: Nauka. cited in Chapelle, Schölkopf & Zien 2006 , p. 3
  7. 1 2 Ratsaby, J.; Venkatesh, S. "Learning from a mixture of labeled and unlabeled examples with parametric side information" (PDF). in Proceedings of the eighth annual conference on Computational learning theory - COLT '95. New York, New York, US: ACM Press. 1995. pp. 412–417. doi:10.1145/225298.225348. ISBN   0-89791-723-5. S2CID   17561403.. Cited in Chapelle, Schölkopf & Zien 2006 , p. 4
  8. Fabio, Cozman; Ira, Cohen (2006-09-22), "Risks of Semi-Supervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers", Semi-Supervised Learning, The MIT Press, pp. 56–72, doi:10.7551/mitpress/9780262033589.003.0004, ISBN   978-0-262-03358-9 In: Chapelle, Schölkopf & Zien 2006
  9. 1 2 3 Zhu, Xiaojin. Semi-Supervised Learning University of Wisconsin-Madison.
  10. M. Belkin; P. Niyogi (2004). "Semi-supervised Learning on Riemannian Manifolds". Machine Learning. 56 (Special Issue on Clustering): 209–239. doi: 10.1023/b:mach.0000033120.25363.1e .
  11. M. Belkin, P. Niyogi, V. Sindhwani. On Manifold Regularization. AISTATS 2005.
  12. Iscen, Ahmet; Tolias, Giorgos; Avrithis, Yannis; Chum, Ondrej (2019). "Label Propagation for Deep Semi-Supervised Learning". 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5065–5074. arXiv: 1904.04717 . doi:10.1109/CVPR.2019.00521. ISBN   978-1-7281-3293-8. S2CID   104291869 . Retrieved 26 March 2021.
  13. Burkhart, Michael C.; Shan, Kyle (2020). "Deep Low-Density Separation for Semi-supervised Classification". International Conference on Computational Science (ICCS). Lecture Notes in Computer Science. Vol. 12139. pp. 297–311. arXiv: 2205.11995 . doi: 10.1007/978-3-030-50420-5_22 . ISBN   978-3-030-50419-9.
  14. Triguero, Isaac; García, Salvador; Herrera, Francisco (2013-11-26). "Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study". Knowledge and Information Systems. 42 (2): 245–284. doi:10.1007/s10115-013-0706-y. ISSN   0219-1377. S2CID   1955810.
  15. Fazakis, Nikos; Karlos, Stamatis; Kotsiantis, Sotiris; Sgarbas, Kyriakos (2015-12-29). "Self-Trained LMT for Semisupervised Learning". Computational Intelligence and Neuroscience. 2016: 3057481. doi: 10.1155/2016/3057481 . PMC   4709606 . PMID   26839531.
  16. Yarowsky, David (1995). "Unsupervised Word Sense Disambiguation Rivaling Supervised Methods". Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. Cambridge, MA: Association for Computational Linguistics: 189–196. doi: 10.3115/981658.981684 . Retrieved 1 November 2022.
  17. Didaci, Luca; Fumera, Giorgio; Roli, Fabio (2012-11-07). Gimel’farb, Georgy; Hancock, Edwin; Imiya, Atsushi; Kuijper, Arjan; Kudo, Mineichi; Omachi, Shinichiro; Windeatt, Terry; Yamada, Keiji (eds.). Analysis of Co-training Algorithm with Very Small Training Sets. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 719–726. doi:10.1007/978-3-642-34166-3_79. ISBN   9783642341656. S2CID   46063225.
  18. Zhu, Xiaojin (2009). Introduction to semi-supervised learning. Goldberg, A. B. (Andrew B.). [San Rafael, Calif.]: Morgan & Claypool Publishers. ISBN   978-1-59829-548-1. OCLC   428541480.
  19. Younger B. A.; Fearing D. D. (1999). "Parsing Items into Separate Categories: Developmental Change in Infant Categorization". Child Development. 70 (2): 291–303. doi:10.1111/1467-8624.00022.
  20. Xu, F. & Tenenbaum, J. B. (2007). "Sensitivity to sampling in Bayesian word learning". Developmental Science. 10 (3): 288–297. CiteSeerX   10.1.1.141.7505 . doi:10.1111/j.1467-7687.2007.00590.x. PMID   17444970.
  21. Gweon, H., Tenenbaum J.B., and Schulz L.E (2010). "Infants consider both the sample and the sampling process in inductive generalization". Proc Natl Acad Sci U S A. 107 (20): 9066–71. Bibcode:2010PNAS..107.9066G. doi: 10.1073/pnas.1003095107 . PMC   2889113 . PMID   20435914.{{cite journal}}: CS1 maint: multiple names: authors list (link)

Sources