Feature learning

Last updated
Diagram of the feature learning paradigm in ML for application to downstream tasks, which can be applied to either raw data such as images or text, or to an initial set of features of the data. Feature learning is intended to result in faster training or better performance in task-specific settings than if the data was input directly (compare transfer learning). Feature Learning Diagram.png
Diagram of the feature learning paradigm in ML for application to downstream tasks, which can be applied to either raw data such as images or text, or to an initial set of features of the data. Feature learning is intended to result in faster training or better performance in task-specific settings than if the data was input directly (compare transfer learning).

In machine learning (ML), feature learning or representation learning [2] is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Contents

Feature learning is motivated by the fact that ML tasks such as classification often require input that is mathematically and computationally convenient to process. However, real-world data, such as image, video, and sensor data, have not yielded to attempts to algorithmically define specific features. An alternative is to discover such features or representations through examination, without relying on explicit algorithms.

Feature learning can be either supervised, unsupervised, or self-supervised:

Supervised

Supervised feature learning is learning features from labeled data. The data label allows the system to compute an error term, the degree to which the system fails to produce the label, which can then be used as feedback to correct the learning process (reduce/minimize the error). Approaches include:

Supervised dictionary learning

Dictionary learning develops a set (dictionary) of representative elements from the input data such that each data point can be represented as a weighted sum of the representative elements. The dictionary elements and the weights may be found by minimizing the average representation error (over the input data), together with L1 regularization on the weights to enable sparsity (i.e., the representation of each data point has only a few nonzero weights).

Supervised dictionary learning exploits both the structure underlying the input data and the labels for optimizing the dictionary elements. For example, this [12] supervised dictionary learning technique applies dictionary learning on classification problems by jointly optimizing the dictionary elements, weights for representing data points, and parameters of the classifier based on the input data. In particular, a minimization problem is formulated, where the objective function consists of the classification error, the representation error, an L1 regularization on the representing weights for each data point (to enable sparse representation of data), and an L2 regularization on the parameters of the classifier.

Neural networks

Neural networks are a family of learning algorithms that use a "network" consisting of multiple layers of inter-connected nodes. It is inspired by the animal nervous system, where the nodes are viewed as neurons and edges are viewed as synapses. Each edge has an associated weight, and the network defines computational rules for passing input data from the network's input layer to the output layer. A network function associated with a neural network characterizes the relationship between input and output layers, which is parameterized by the weights. With appropriately defined network functions, various learning tasks can be performed by minimizing a cost function over the network function (weights).

Multilayer neural networks can be used to perform feature learning, since they learn a representation of their input at the hidden layer(s) which is subsequently used for classification or regression at the output layer. The most popular network architecture of this type is Siamese networks.

Unsupervised

Unsupervised feature learning is learning features from unlabeled data. The goal of unsupervised feature learning is often to discover low-dimensional features that capture some structure underlying the high-dimensional input data. When the feature learning is performed in an unsupervised way, it enables a form of semisupervised learning where features learned from an unlabeled dataset are then employed to improve performance in a supervised setting with labeled data. [13] [14] Several approaches are introduced in the following.

K-means clustering

K-means clustering is an approach for vector quantization. In particular, given a set of n vectors, k-means clustering groups them into k clusters (i.e., subsets) in such a way that each vector belongs to the cluster with the closest mean. The problem is computationally NP-hard, although suboptimal greedy algorithms have been developed.

K-means clustering can be used to group an unlabeled set of inputs into k clusters, and then use the centroids of these clusters to produce features. These features can be produced in several ways. The simplest is to add k binary features to each sample, where each feature j has value one iff the jth centroid learned by k-means is the closest to the sample under consideration. [6] It is also possible to use the distances to the clusters as features, perhaps after transforming them through a radial basis function (a technique that has been used to train RBF networks [15] ). Coates and Ng note that certain variants of k-means behave similarly to sparse coding algorithms. [16]

In a comparative evaluation of unsupervised feature learning methods, Coates, Lee and Ng found that k-means clustering with an appropriate transformation outperforms the more recently invented auto-encoders and RBMs on an image classification task. [6] K-means also improves performance in the domain of NLP, specifically for named-entity recognition; [17] there, it competes with Brown clustering, as well as with distributed word representations (also known as neural word embeddings). [14]

Principal component analysis

Principal component analysis (PCA) is often used for dimension reduction. Given an unlabeled set of n input data vectors, PCA generates p (which is much smaller than the dimension of the input data) right singular vectors corresponding to the p largest singular values of the data matrix, where the kth row of the data matrix is the kth input data vector shifted by the sample mean of the input (i.e., subtracting the sample mean from the data vector). Equivalently, these singular vectors are the eigenvectors corresponding to the p largest eigenvalues of the sample covariance matrix of the input vectors. These p singular vectors are the feature vectors learned from the input data, and they represent directions along which the data has the largest variations.

PCA is a linear feature learning approach since the p singular vectors are linear functions of the data matrix. The singular vectors can be generated via a simple algorithm with p iterations. In the ith iteration, the projection of the data matrix on the (i-1)th eigenvector is subtracted, and the ith singular vector is found as the right singular vector corresponding to the largest singular of the residual data matrix.

PCA has several limitations. First, it assumes that the directions with large variance are of most interest, which may not be the case. PCA only relies on orthogonal transformations of the original data, and it exploits only the first- and second-order moments of the data, which may not well characterize the data distribution. Furthermore, PCA can effectively reduce dimension only when the input data vectors are correlated (which results in a few dominant eigenvalues).

Local linear embedding

Local linear embedding (LLE) is a nonlinear learning approach for generating low-dimensional neighbor-preserving representations from (unlabeled) high-dimension input. The approach was proposed by Roweis and Saul (2000). [18] [19] The general idea of LLE is to reconstruct the original high-dimensional data using lower-dimensional points while maintaining some geometric properties of the neighborhoods in the original data set.

LLE consists of two major steps. The first step is for "neighbor-preserving", where each input data point Xi is reconstructed as a weighted sum of K nearest neighbor data points, and the optimal weights are found by minimizing the average squared reconstruction error (i.e., difference between an input point and its reconstruction) under the constraint that the weights associated with each point sum up to one. The second step is for "dimension reduction," by looking for vectors in a lower-dimensional space that minimizes the representation error using the optimized weights in the first step. Note that in the first step, the weights are optimized with fixed data, which can be solved as a least squares problem. In the second step, lower-dimensional points are optimized with fixed weights, which can be solved via sparse eigenvalue decomposition.

The reconstruction weights obtained in the first step capture the "intrinsic geometric properties" of a neighborhood in the input data. [19] It is assumed that original data lie on a smooth lower-dimensional manifold, and the "intrinsic geometric properties" captured by the weights of the original data are also expected to be on the manifold. This is why the same weights are used in the second step of LLE. Compared with PCA, LLE is more powerful in exploiting the underlying data structure.

Independent component analysis

Independent component analysis (ICA) is a technique for forming a data representation using a weighted sum of independent non-Gaussian components. [20] The assumption of non-Gaussian is imposed since the weights cannot be uniquely determined when all the components follow Gaussian distribution.

Unsupervised dictionary learning

Unsupervised dictionary learning does not utilize data labels and exploits the structure underlying the data for optimizing dictionary elements. An example of unsupervised dictionary learning is sparse coding, which aims to learn basis functions (dictionary elements) for data representation from unlabeled input data. Sparse coding can be applied to learn overcomplete dictionaries, where the number of dictionary elements is larger than the dimension of the input data. [21] Aharon et al. proposed algorithm K-SVD for learning a dictionary of elements that enables sparse representation. [22]

Multilayer/deep architectures

The hierarchical architecture of the biological neural system inspires deep learning architectures for feature learning by stacking multiple layers of learning nodes. [23] These architectures are often designed based on the assumption of distributed representation: observed data is generated by the interactions of many different factors on multiple levels. In a deep learning architecture, the output of each intermediate layer can be viewed as a representation of the original input data. Each level uses the representation produced by the previous, lower level as input, and produces new representations as output, which are then fed to higher levels. The input at the bottom layer is raw data, and the output of the final, highest layer is the final low-dimensional feature or representation.

Restricted Boltzmann machine

Restricted Boltzmann machines (RBMs) are often used as a building block for multilayer learning architectures. [6] [24] An RBM can be represented by an undirected bipartite graph consisting of a group of binary hidden variables, a group of visible variables, and edges connecting the hidden and visible nodes. It is a special case of the more general Boltzmann machines with the constraint of no intra-node connections. Each edge in an RBM is associated with a weight. The weights together with the connections define an energy function, based on which a joint distribution of visible and hidden nodes can be devised. Based on the topology of the RBM, the hidden (visible) variables are independent, conditioned on the visible (hidden) variables.[ clarification needed ] Such conditional independence facilitates computations.

An RBM can be viewed as a single layer architecture for unsupervised feature learning. In particular, the visible variables correspond to input data, and the hidden variables correspond to feature detectors. The weights can be trained by maximizing the probability of visible variables using Hinton's contrastive divergence (CD) algorithm. [24]

In general, training RBMs by solving the maximization problem tends to result in non-sparse representations. Sparse RBM [25] was proposed to enable sparse representations. The idea is to add a regularization term in the objective function of data likelihood, which penalizes the deviation of the expected hidden variables from a small constant . RBMs have also been used to obtain disentangled representations of data, where interesting features map to separate hidden units. [26]

Autoencoder

An autoencoder consisting of an encoder and a decoder is a paradigm for deep learning architectures. An example is provided by Hinton and Salakhutdinov [24] where the encoder uses raw data (e.g., image) as input and produces feature or representation as output and the decoder uses the extracted feature from the encoder as input and reconstructs the original input raw data as output. The encoder and decoder are constructed by stacking multiple layers of RBMs. The parameters involved in the architecture were originally trained in a greedy layer-by-layer manner: after one layer of feature detectors is learned, they are fed up as visible variables for training the corresponding RBM. Current approaches typically apply end-to-end training with stochastic gradient descent methods. Training can be repeated until some stopping criteria are satisfied.

Self-supervised

Self-supervised representation learning is learning features by training on the structure of unlabeled data rather than relying on explicit labels for an information signal. This approach has enabled the combined use of deep neural network architectures and larger unlabeled datasets to produce deep feature representations. [9] Training tasks typically fall under the classes of either contrastive, generative or both. [27] Contrastive representation learning trains representations for associated data pairs, called positive samples, to be aligned, while pairs with no relation, called negative samples, are contrasted. A larger portion of negative samples is typically necessary in order to prevent catastrophic collapse, which is when all inputs are mapped to the same representation. [9] Generative representation learning tasks the model with producing the correct data to either match a restricted input or reconstruct the full input from a lower dimensional representation. [27]

A common setup for self-supervised representation learning of a certain data type (e.g. text, image, audio, video) is to pretrain the model using large datasets of general context, unlabeled data. [11] Depending on the context, the result of this is either a set of representations for common data segments (e.g. words) which new data can be broken into, or a neural network able to convert each new data point (e.g. image) into a set of lower dimensional features. [9] In either case, the output representations can then be used as an initialization in many different problem settings where labeled data may be limited. Specialization of the model to specific tasks is typically done with supervised learning, either by fine-tuning the model / representations with the labels as the signal, or freezing the representations and training an additional model which takes them as an input. [11]

Many self-supervised training schemes have been developed for use in representation learning of various modalities, often first showing successful application in text or image before being transferred to other data types. [9]

Text

Word2vec is a word embedding technique which learns to represent words through self-supervision over each word and its neighboring words in a sliding window across a large corpus of text. [28] The model has two possible training schemes to produce word vector representations, one generative and one contrastive. [27] The first is word prediction given each of the neighboring words as an input. [28] The second is training on the representation similarity for neighboring words and representation dissimilarity for random pairs of words. [10] A limitation of word2vec is that only the pairwise co-occurrence structure of the data is used, and not the ordering or entire set of context words. More recent transformer-based representation learning approaches attempt to solve this with word prediction tasks. [9] GPTs pretrain on next word prediction using prior input words as context, [29] whereas BERT masks random tokens in order to provide bidirectional context. [30]

Other self-supervised techniques extend word embeddings by finding representations for larger text structures such as sentences or paragraphs in the input data. [9] Doc2vec extends the generative training approach in word2vec by adding an additional input to the word prediction task based on the paragraph it is within, and is therefore intended to represent paragraph level context. [31]

Image

The domain of image representation learning has employed many different self-supervised training techniques, including transformation, [32] inpainting, [33] patch discrimination [34] and clustering. [35]

Examples of generative approaches are Context Encoders, which trains an AlexNet CNN architecture to generate a removed image region given the masked image as input, [33] and iGPT, which applies the GPT-2 language model architecture to images by training on pixel prediction after reducing the image resolution. [36]

Many other self-supervised methods use siamese networks, which generate different views of the image through various augmentations that are then aligned to have similar representations. The challenge is avoiding collapsing solutions where the model encodes all images to the same representation. [37] SimCLR is a contrastive approach which uses negative examples in order to generate image representations with a ResNet CNN. [34] Bootstrap Your Own Latent (BYOL) removes the need for negative samples by encoding one of the views with a slow moving average of the model parameters as they are being modified during training. [38]

Graph

The goal of many graph representation learning techniques is to produce an embedded representation of each node based on the overall network topology. [39] node2vec extends the word2vec training technique to nodes in a graph by using co-occurrence in random walks through the graph as the measure of association. [40] Another approach is to maximize mutual information, a measure of similarity, between the representations of associated structures within the graph. [9] An example is Deep Graph Infomax, which uses contrastive self-supervision based on mutual information between the representation of a “patch” around each node, and a summary representation of the entire graph. Negative samples are obtained by pairing the graph representation with either representations from another graph in a multigraph training setting, or corrupted patch representations in single graph training. [41]

Video

With analogous results in masked prediction [42] and clustering, [43] video representation learning approaches are often similar to image techniques but must utilize the temporal sequence of video frames as an additional learned structure. Examples include VCP, which masks video clips and trains to choose the correct one given a set of clip options, and Xu et al., who train a 3D-CNN to identify the original order given a shuffled set of video clips. [44]

Audio

Self-supervised representation techniques have also been applied to many audio data formats, particularly for speech processing. [9] Wav2vec 2.0 discretizes the audio waveform into timesteps via temporal convolutions, and then trains a transformer on masked prediction of random timesteps using a contrastive loss. [45] This is similar to the BERT language model, except as in many SSL approaches to video, the model chooses among a set of options rather than over the entire word vocabulary. [30] [45]

Multimodal

Self-supervised learning has also been used to develop joint representations of multiple data types. [9] Approaches usually rely on some natural or human-derived association between the modalities as an implicit label, for instance video clips of animals or objects with characteristic sounds, [46] or captions written to describe images. [47] CLIP produces a joint image-text representation space by training to align image and text encodings from a large dataset of image-caption pairs using a contrastive loss. [47] MERLOT Reserve trains a transformer-based encoder to jointly represent audio, subtitles and video frames from a large dataset of videos through 3 joint pretraining tasks: contrastive masked prediction of either audio or text segments given the video frames and surrounding audio and text context, along with contrastive alignment of video frames with their corresponding captions. [46]

Multimodal representation models are typically unable to assume direct correspondence of representations in the different modalities, since the precise alignment can often be noisy or ambiguous. For example, the text "dog" could be paired with many different pictures of dogs, and correspondingly a picture of a dog could be captioned with varying degrees of specificity. This limitation means that downstream tasks may require an additional generative mapping network between modalities to achieve optimal performance, such as in DALLE-2 for text to image generation. [48]

Dynamic Representation Learning

Dynamic representation learning methods [49] [50] generate latent embeddings for dynamic systems such as dynamic networks. Since particular distance functions are invariant under particular linear transformations, different sets of embedding vectors can actually represent the same/similar information. Therefore, for a dynamic system, a temporal difference in its embeddings may be explained by misalignment of embeddings due to arbitrary transformations and/or actual changes in the system. [51] Therefore, generally speaking, temporal embeddings learned via dynamic representation learning methods should be inspected for any spurious changes and be aligned before consequent dynamic analyses.

See also

Related Research Articles

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Advances in the field of deep learning have allowed neural networks to surpass many previous approaches in performance.

Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.

<span class="mw-page-title-main">Nonlinear dimensionality reduction</span> Projection of data onto lower-dimensional manifolds

Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data, potentially existing across non-linear manifolds which cannot be adequately captured by linear decomposition methods, onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.

Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable. Dimensionality reduction is common in fields that deal with large numbers of observations and/or large numbers of variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.

<span class="mw-page-title-main">Autoencoder</span> Neural network that learns efficient data encoding in an unsupervised manner

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.

Hierarchical temporal memory (HTM) is a biologically constrained machine intelligence technology developed by Numenta. Originally described in the 2004 book On Intelligence by Jeff Hawkins with Sandra Blakeslee, HTM is primarily used today for anomaly detection in streaming data. The technology is based on neuroscience and the physiology and interaction of pyramidal neurons in the neocortex of the mammalian brain.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is a subset of machine learning that focuses on utilizing neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience and is centered around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">Restricted Boltzmann machine</span> Class of artificial neural network

A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

A recursive neural network is a kind of deep neural network created by applying the same set of weights recursively over a structured input, to produce a structured prediction over variable-size input structures, or a scalar prediction on it, by traversing a given structure in topological order. These networks were first introduced to learn distributed representations of structure, but have been successful in multiple applications, for instance in learning sequence and tree structures in natural language processing.

Weak supervision is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is characterized by using a combination of a small amount of human-labeled data, followed by a large amount of unlabeled data. In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam. Technically, it could be viewed as performing clustering and then labeling the clusters with the labeled data, pushing the decision boundary away from high-density regions, or learning an underlying one-dimensional manifold where the data reside.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Deep learning architecture for modelling sequential data

A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent text as a sequence of vectors using self-supervised learning. It uses the encoder-only transformer architecture. It is notable for its dramatic improvement over previous state-of-the-art models, and as an early example of a large language model. As of 2020, BERT is a ubiquitous baseline in natural language processing (NLP) experiments.

A latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another. Position within the latent space can be viewed as being defined by a set of latent variables that emerge from the resemblances from the objects.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving them requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples, where one sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

<span class="mw-page-title-main">Knowledge graph embedding</span> Dimensionality reduction of graph-based semantic data objects [machine learning task]

In representation learning, knowledge graph embedding (KGE), also referred to as knowledge representation learning (KRL), or multi-relation learning, is a machine learning task of learning a low-dimensional representation of a knowledge graph's entities and relations while preserving their semantic meaning. Leveraging their embedded representation, knowledge graphs (KGs) can be used for various applications such as link prediction, triple classification, entity recognition, clustering, and relation extraction.

A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.

<span class="mw-page-title-main">Vision transformer</span> Machine learning model for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

References

  1. Goodfellow, Ian (2016). Deep learning. Yoshua Bengio, Aaron Courville. Cambridge, Massachusetts. pp. 524–534. ISBN   0-262-03561-8. OCLC   955778308.
  2. Y. Bengio; A. Courville; P. Vincent (2013). "Representation Learning: A Review and New Perspectives". IEEE Transactions on Pattern Analysis and Machine Intelligence. 35 (8): 1798–1828. arXiv: 1206.5538 . doi:10.1109/tpami.2013.50. PMID   23787338. S2CID   393948.
  3. Stuart J. Russell, Peter Norvig (2010) Artificial Intelligence: A Modern Approach, Third Edition, Prentice Hall ISBN   978-0-13-604259-4.
  4. Hinton, Geoffrey; Sejnowski, Terrence (1999). Unsupervised Learning: Foundations of Neural Computation. MIT Press. ISBN   978-0-262-58168-4.
  5. Nathan Srebro; Jason D. M. Rennie; Tommi S. Jaakkola (2004). Maximum-Margin Matrix Factorization. NIPS.
  6. 1 2 3 4 Coates, Adam; Lee, Honglak; Ng, Andrew Y. (2011). An analysis of single-layer networks in unsupervised feature learning (PDF). Int'l Conf. on AI and Statistics (AISTATS). Archived from the original (PDF) on 2017-08-13. Retrieved 2014-11-24.
  7. Csurka, Gabriella; Dance, Christopher C.; Fan, Lixin; Willamowski, Jutta; Bray, Cédric (2004). Visual categorization with bags of keypoints (PDF). ECCV Workshop on Statistical Learning in Computer Vision.
  8. Daniel Jurafsky; James H. Martin (2009). Speech and Language Processing. Pearson Education International. pp. 145–146.
  9. 1 2 3 4 5 6 7 8 9 10 11 Ericsson, Linus; Gouk, Henry; Loy, Chen Change; Hospedales, Timothy M. (May 2022). "Self-Supervised Representation Learning: Introduction, advances, and challenges". IEEE Signal Processing Magazine. 39 (3): 42–62. arXiv: 2110.09327 . Bibcode:2022ISPM...39c..42E. doi:10.1109/MSP.2021.3134634. ISSN   1558-0792. S2CID   239017006.
  10. 1 2 Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S; Dean, Jeff (2013). "Distributed Representations of Words and Phrases and their Compositionality". Advances in Neural Information Processing Systems. 26. Curran Associates, Inc. arXiv: 1310.4546 .
  11. 1 2 3 Goodfellow, Ian (2016). Deep learning. Yoshua Bengio, Aaron Courville. Cambridge, Massachusetts. pp. 499–516. ISBN   0-262-03561-8. OCLC   955778308.
  12. Mairal, Julien; Bach, Francis; Ponce, Jean; Sapiro, Guillermo; Zisserman, Andrew (2009). "Supervised Dictionary Learning". Advances in Neural Information Processing Systems.
  13. Percy Liang (2005). Semi-Supervised Learning for Natural Language (PDF) (M. Eng.). MIT. pp. 44–52.
  14. 1 2 Joseph Turian; Lev Ratinov; Yoshua Bengio (2010). Word representations: a simple and general method for semi-supervised learning (PDF). Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Archived from the original (PDF) on 2014-02-26. Retrieved 2014-02-22.
  15. Schwenker, Friedhelm; Kestler, Hans A.; Palm, Günther (2001). "Three learning phases for radial-basis-function networks". Neural Networks. 14 (4–5): 439–458. CiteSeerX   10.1.1.109.312 . doi:10.1016/s0893-6080(01)00027-2. PMID   11411631.
  16. Coates, Adam; Ng, Andrew Y. (2012). "Learning feature representations with k-means". In G. Montavon, G. B. Orr and K.-R. Müller (ed.). Neural Networks: Tricks of the Trade. Springer.
  17. Dekang Lin; Xiaoyun Wu (2009). Phrase clustering for discriminative learning (PDF). Proc. J. Conf. of the ACL and 4th Int'l J. Conf. on Natural Language Processing of the AFNLP. pp. 1030–1038. Archived from the original (PDF) on 2016-03-03. Retrieved 2013-07-14.
  18. Roweis, Sam T; Saul, Lawrence K (2000). "Nonlinear Dimensionality Reduction by Locally Linear Embedding". Science. New Series. 290 (5500): 2323–2326. Bibcode:2000Sci...290.2323R. doi:10.1126/science.290.5500.2323. JSTOR   3081722. PMID   11125150. S2CID   5987139.
  19. 1 2 Saul, Lawrence K; Roweis, Sam T (2000). "An Introduction to Locally Linear Embedding".{{cite journal}}: Cite journal requires |journal= (help)
  20. Hyvärinen, Aapo; Oja, Erkki (2000). "Independent Component Analysis: Algorithms and Applications". Neural Networks. 13 (4): 411–430. doi:10.1016/s0893-6080(00)00026-5. PMID   10946390. S2CID   11959218.
  21. Lee, Honglak; Battle, Alexis; Raina, Rajat; Ng, Andrew Y (2007). "Efficient sparse coding algorithms". Advances in Neural Information Processing Systems.
  22. Aharon, Michal; Elad, Michael; Bruckstein, Alfred (2006). "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation". IEEE Trans. Signal Process. 54 (11): 4311–4322. Bibcode:2006ITSP...54.4311A. doi:10.1109/TSP.2006.881199. S2CID   7477309.
  23. Bengio, Yoshua (2009). "Learning Deep Architectures for AI". Foundations and Trends in Machine Learning. 2 (1): 1–127. doi:10.1561/2200000006. S2CID   207178999.
  24. 1 2 3 Hinton, G. E.; Salakhutdinov, R. R. (2006). "Reducing the Dimensionality of Data with Neural Networks" (PDF). Science. 313 (5786): 504–507. Bibcode:2006Sci...313..504H. doi:10.1126/science.1127647. PMID   16873662. S2CID   1658773. Archived from the original (PDF) on 2015-12-23. Retrieved 2015-08-29.
  25. Lee, Honglak; Ekanadham, Chaitanya; Andrew, Ng (2008). "Sparse deep belief net model for visual area V2". Advances in Neural Information Processing Systems.
  26. Fernandez-de-Cossio-Diaz, Jorge; Cocco, Simona; Monasson, Rémi (2023-04-05). "Disentangling Representations in Restricted Boltzmann Machines without Adversaries". Physical Review X. 13 (2): 021003. arXiv: 2206.11600 . Bibcode:2023PhRvX..13b1003F. doi:10.1103/PhysRevX.13.021003.
  27. 1 2 3 Liu, Xiao; Zhang, Fanjin; Hou, Zhenyu; Mian, Li; Wang, Zhaoyu; Zhang, Jing; Tang, Jie (2021). "Self-supervised Learning: Generative or Contrastive". IEEE Transactions on Knowledge and Data Engineering. 35 (1): 857–876. arXiv: 2006.08218 . doi:10.1109/TKDE.2021.3090866. ISSN   1558-2191. S2CID   219687051.
  28. 1 2 Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013-09-06). "Efficient Estimation of Word Representations in Vector Space". arXiv: 1301.3781 [cs.CL].
  29. "Improving Language Understanding by Generative Pre-Training" (PDF). Retrieved October 10, 2022.
  30. 1 2 Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (June 2019). "Proceedings of the 2019 Conference of the North". Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics: 4171–4186. doi:10.18653/v1/N19-1423. S2CID   52967399.
  31. Le, Quoc; Mikolov, Tomas (2014-06-18). "Distributed Representations of Sentences and Documents". International Conference on Machine Learning. PMLR: 1188–1196. arXiv: 1405.4053 .
  32. Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
  33. 1 2 Pathak, Deepak; Krahenbuhl, Philipp; Donahue, Jeff; Darrell, Trevor; Efros, Alexei A. (2016). "Context Encoders: Feature Learning by Inpainting": 2536–2544. arXiv: 1604.07379 .{{cite journal}}: Cite journal requires |journal= (help)
  34. 1 2 Chen, Ting; Kornblith, Simon; Norouzi, Mohammad; Hinton, Geoffrey (2020-11-21). "A Simple Framework for Contrastive Learning of Visual Representations". International Conference on Machine Learning. PMLR: 1597–1607.
  35. Mathilde, Caron; Ishan, Misra; Julien, Mairal; Priya, Goyal; Piotr, Bojanowski; Armand, Joulin (2020). "Unsupervised Learning of Visual Features by Contrasting Cluster Assignments". Advances in Neural Information Processing Systems. 33. arXiv: 2006.09882 .
  36. Chen, Mark; Radford, Alec; Child, Rewon; Wu, Jeffrey; Jun, Heewoo; Luan, David; Sutskever, Ilya (2020-11-21). "Generative Pretraining From Pixels". International Conference on Machine Learning. PMLR: 1691–1703.
  37. Chen, Xinlei; He, Kaiming (2021). "Exploring Simple Siamese Representation Learning": 15750–15758. arXiv: 2011.10566 .{{cite journal}}: Cite journal requires |journal= (help)
  38. Jean-Bastien, Grill; Florian, Strub; Florent, Altché; Corentin, Tallec; Pierre, Richemond; Elena, Buchatskaya; Carl, Doersch; Bernardo, Avila Pires; Zhaohan, Guo; Mohammad, Gheshlaghi Azar; Bilal, Piot; koray, kavukcuoglu; Remi, Munos; Michal, Valko (2020). "Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning". Advances in Neural Information Processing Systems. 33.
  39. Cai, HongYun; Zheng, Vincent W.; Chang, Kevin Chen-Chuan (September 2018). "A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications". IEEE Transactions on Knowledge and Data Engineering. 30 (9): 1616–1637. arXiv: 1709.07604 . doi:10.1109/TKDE.2018.2807452. ISSN   1558-2191. S2CID   13999578.
  40. Grover, Aditya; Leskovec, Jure (2016-08-13). "Node2vec". Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD '16. Vol. 2016. New York, NY, USA: Association for Computing Machinery. pp. 855–864. doi:10.1145/2939672.2939754. ISBN   978-1-4503-4232-2. PMC   5108654 . PMID   27853626.
  41. Velikovi, P., Fedus, W., Hamilton, W. L., Li, P., Bengio, Y., and Hjelm, R. D. Deep Graph InfoMax. In International Conference on Learning Representations (ICLR’2019), 2019.
  42. Luo, Dezhao; Liu, Chang; Zhou, Yu; Yang, Dongbao; Ma, Can; Ye, Qixiang; Wang, Weiping (2020-04-03). "Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning". Proceedings of the AAAI Conference on Artificial Intelligence. 34 (7): 11701–11708. arXiv: 2001.00294 . doi: 10.1609/aaai.v34i07.6840 . ISSN   2374-3468. S2CID   209531629.
  43. Humam, Alwassel; Dhruv, Mahajan; Bruno, Korbar; Lorenzo, Torresani; Bernard, Ghanem; Du, Tran (2020). "Self-Supervised Learning by Cross-Modal Audio-Video Clustering". Advances in Neural Information Processing Systems. 33. arXiv: 1911.12667 .
  44. Xu, Dejing; Xiao, Jun; Zhao, Zhou; Shao, Jian; Xie, Di; Zhuang, Yueting (June 2019). "Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction". 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10326–10335. doi:10.1109/CVPR.2019.01058. ISBN   978-1-7281-3293-8. S2CID   195504152.
  45. 1 2 Alexei, Baevski; Yuhao, Zhou; Abdelrahman, Mohamed; Michael, Auli (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations". Advances in Neural Information Processing Systems. 33. arXiv: 2006.11477 .
  46. 1 2 Zellers, Rowan; Lu, Jiasen; Lu, Ximing; Yu, Youngjae; Zhao, Yanpeng; Salehi, Mohammadreza; Kusupati, Aditya; Hessel, Jack; Farhadi, Ali; Choi, Yejin (2022). "MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound": 16375–16387. arXiv: 2201.02639 .{{cite journal}}: Cite journal requires |journal= (help)
  47. 1 2 Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal, Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela; Clark, Jack; Krueger, Gretchen; Sutskever, Ilya (2021-07-01). "Learning Transferable Visual Models From Natural Language Supervision". International Conference on Machine Learning. PMLR: 8748–8763. arXiv: 2103.00020 .
  48. Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022-04-12). "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv: 2204.06125 [cs.CV].
  49. Zhang, Daokun; Yin, Jie; Zhu, Xingquan; Zhang, Chengqi (March 2020). "Network Representation Learning: A Survey". IEEE Transactions on Big Data. 6 (1): 3–28. arXiv: 1801.05852 . doi:10.1109/TBDATA.2018.2850013. ISSN   2332-7790. S2CID   1479507.
  50. Atzberger, Paul; Lopez, Ryan (2021). "Variational Autoencoders for Learning Nonlinear Dynamics of Physical Systems". AAAI-MLPS Proceedings. arXiv: 2012.03448 .
  51. Gürsoy, Furkan; Haddad, Mounir; Bothorel, Cécile (2023-10-07). "Alignment and stability of embeddings: Measurement and inference improvement". Neurocomputing. 553: 126517. arXiv: 2101.07251 . doi:10.1016/j.neucom.2023.126517. ISSN   0925-2312. S2CID   231632462.