Grokking (machine learning)

Last updated

In machine learning, grokking, or delayed generalization, is a transition to generalization that occurs many training iterations after the interpolation threshold, after many iterations of seemingly little progress, as opposed to the usual process where generalization occurs slowly and progressively once the interpolation threshold has been reached. [1] [2] [3]

The term derives from the word grok coined by Robert Heinlein in his novel Stranger in a Strange Land .

Grokking can be understood as a phase transition during the training process. [4] While grokking has been thought of as largely a phenomenon of relatively shallow models, grokking has been observed in deep neural networks and non-neural models and is the subject of active research. [5] [6] [7] [8]

Related Research Articles

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent ought to take actions in a dynamic environment in order to maximize the cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment, and it can handle problems with stochastic transitions and rewards without requiring adaptations.

<span class="mw-page-title-main">Activation function</span> Artificial neural network node function

The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is nonlinear. Modern activation functions include the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model, the logistic (sigmoid) function used in the 2012 speech recognition model developed by Hinton et al, the ReLU used in the 2012 AlexNet computer vision model and in the 2015 ResNet model.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

<span class="mw-page-title-main">Symbolic regression</span> Type of regression analysis

Symbolic regression (SR) is a type of regression analysis that searches the space of mathematical expressions to find the model that best fits a given dataset, both in terms of accuracy and simplicity.

Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par or outperform hand-designed architectures. Methods for NAS can be categorized according to the search space, search strategy and performance estimation strategy used:

Multi-task optimization is a paradigm in the optimization literature that focuses on solving multiple self-contained tasks simultaneously. The paradigm has been inspired by the well-established concepts of transfer learning and multi-task learning in predictive analytics.

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.

<span class="mw-page-title-main">Federated learning</span> Decentralized machine learning

Federated learning is a sub-field of machine learning focusing on settings in which multiple entities collaboratively train a model while ensuring that their data remains decentralized. This stands in contrast to machine learning settings in which data is centrally stored. One of the primary defining characteristics of federated learning is data heterogeneity. Due to the decentralized nature of the clients' data, there is no guarantee that data samples held by each client are independently and identically distributed.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

An energy-based model (EBM) (also called a Canonical Ensemble Learning(CEL) or Learning via Canonical Ensemble (LCE)) is an application of canonical ensemble formulation of statistical physics for learning from data problems. The approach prominently appears in generative models (GMs).

<span class="mw-page-title-main">Large width limits of neural networks</span> Feature of artificial neural networks

Artificial neural networks are a class of models used in machine learning, and inspired by biological neural networks. They are the core component of modern deep learning algorithms. Computation in artificial neural networks is usually organized into sequential layers of artificial neurons. The number of neurons in a layer is called the layer width. Theoretical analysis of artificial neural networks sometimes considers the limiting case that layer width becomes large or infinite. This limit enables simple analytic statements to be made about neural network predictions, training dynamics, generalization, and loss surfaces. This wide layer limit is also of practical interest, since finite width neural networks often perform strictly better as layer width is increased.

The machine learning-based attention method simulates how human attention works by assigning varying levels of importance to different components of a sequence. In natural language processing, this usually means assigning different levels of importance to different words in a sentence. It assigns importance to each word by calculating "soft" weights for the word's numerical representation, known as its embedding, within a specific section of the sentence called the context window to determine its importance. The calculation of these weights can occur simultaneously in models called transformers, or one by one in models known as recurrent neural networks. Unlike "hard" weights, which are predetermined and fixed during training, "soft" weights can adapt and change with each use of the model.

<span class="mw-page-title-main">Physics-informed neural networks</span> Technique to solve partial differential equations

Physics-informed neural networks (PINNs), also referred to as Theory-Trained Neural Networks (TTNs), are a type of universal function approximators that can embed the knowledge of any physical laws that govern a given data-set in the learning process, and can be described by partial differential equations (PDEs).They overcome the low data availability of some biological and engineering systems that makes most state-of-the-art machine learning techniques lack robustness, rendering them ineffective in these scenarios. The prior knowledge of general physical laws acts in the training of neural networks (NNs) as a regularization agent that limits the space of admissible solutions, increasing the correctness of the function approximation. This way, embedding this prior information into a neural network results in enhancing the information content of the available data, facilitating the learning algorithm to capture the right solution and to generalize well even with a low amount of training examples.

A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.

Curriculum learning is a technique in machine learning in which a model is trained on examples of increasing difficulty, where the definition of "difficulty" may be provided externally or discovered automatically as part of the training process. This is intended to attain good performance more quickly, or to converge to a better local optimum if the global optimum is not found.

References

  1. Pearce, Adam; Ghandeharioun, Asma; Hussein, Nada; Thain, Nithum; Wattenberg, Martin; Dixon, Lucas. "Do Machine Learning Models Memorize or Generalize?". pair.withgoogle.com. Retrieved 2024-06-04.
  2. Power, Alethea; Burda, Yuri; Edwards, Harri; Babuschkin, Igor; Misra, Vedant (2022-01-06). "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets". arXiv: 2201.02177 [cs.LG].
  3. Minegishi, Gouki; Iwasawa, Yusuke; Matsuo, Yutaka (2024-05-09). "Bridging Lottery ticket and Grokking: Is Weight Norm Sufficient to Explain Delayed Generalization?". arXiv: 2310.19470 [cs.LG].
  4. Liu, Ziming; Kitouni, Ouail; Nolte, Niklas; Michaud, Eric J.; Tegmark, Max; Williams, Mike (2022). "Towards Understanding Grokking: An Effective Theory of Representation Learning". In Koyejo, Sanmi; Mohamed, S.; Agarwal, A.; Belgrave, Danielle; Cho, K.; Oh, A. (eds.). Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 – December 9, 2022. arXiv: 2205.10343 .
  5. Fan, Simin; Pascanu, Razvan; Jaggi, Martin (2024-05-29). "Deep Grokking: Would Deep Neural Networks Generalize Better?". arXiv: 2405.19454 [cs.LG].
  6. Miller, Jack; O'Neill, Charles; Bui, Thang (2024-03-31). "Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity". arXiv: 2310.17247 [cs.LG].
  7. Liu, Ziming; Michaud, Eric J.; Tegmark, Max (2023). "Omnigrok: Grokking Beyond Algorithmic Data". The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1–5, 2023. OpenReview.net. arXiv: 2210.01117 .
  8. Samothrakis, Spyridon; Matran-Fernandez, Ana; Abdullahi, Umar I.; Fairbank, Michael; Fasli, Maria (2022). "Grokking-like effects in counterfactual inference". International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, July 18-23, 2022. IEEE. pp. 1–8. doi:10.1109/IJCNN55064.2022.9891910. ISBN   978-1-7281-8671-9.

See also