Grokking (machine learning)

Last updated March 30, 2025

In machine learning, grokking, or delayed generalization, is a transition to generalization that occurs many training iterations after the interpolation threshold, after many iterations of seemingly little progress, as opposed to the usual process where generalization occurs slowly and progressively once the interpolation threshold has been reached.^[2]^[3]^[4]

Transition

Grokking was introduced in January 2022 by OpenAI researchers investigating how neural network perform calculations. It derives from the word grok coined by Robert Heinlein in his novel Stranger in a Strange Land .^[1]

Grokking can be understood as a phase transition during the training process.^[5] While grokking has been thought of as largely a phenomenon of relatively shallow models, grokking has been observed in deep neural networks and non-neural models and is the subject of active research.^[6]^[7]^[8]^[9]

One potential explanation is that the weight decay (a component of the loss function that penalizes higher values of the neural network parameters, also called regularization) slightly favors the general solution that involves lower weight values, but that is also harder to find. According to Neel Nanda, the process of learning the general solution may be gradual, even though the transition to the general solution occurs more suddenly later.^[1]

Recent theories^[10]^[11] have hypothesized that grokking occurs when neural networks transition from a "lazy training"^[12] regime where the weights do not deviate far from initialization, to a "rich" regime where weights abruptly begin to move in task-relevant directions. Follow-up empirical and theoretical work^[13] has accumulated evidence in support of this perspective, and it offers a unifying view of earlier work as the transition from lazy to rich training dynamics is known to arise from properties of adaptive optimizers,^[14] weight decay,^[15] initial parameter weight norm,^[8] and more.

References

1 2 3 Ananthaswamy, Anil (2024-04-12). "How Do Machines 'Grok' Data?". Quanta Magazine. Retrieved 2025-01-21.
↑ Pearce, Adam; Ghandeharioun, Asma; Hussein, Nada; Thain, Nithum; Wattenberg, Martin; Dixon, Lucas. "Do Machine Learning Models Memorize or Generalize?". pair.withgoogle.com. Retrieved 2024-06-04.
↑ Power, Alethea; Burda, Yuri; Edwards, Harri; Babuschkin, Igor; Misra, Vedant (2022-01-06). "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets". arXiv: 2201.02177 [cs.LG].
↑ Minegishi, Gouki; Iwasawa, Yusuke; Matsuo, Yutaka (2024-05-09). "Bridging Lottery ticket and Grokking: Is Weight Norm Sufficient to Explain Delayed Generalization?". arXiv: 2310.19470 [cs.LG].
↑ Liu, Ziming; Kitouni, Ouail; Nolte, Niklas; Michaud, Eric J.; Tegmark, Max; Williams, Mike (2022). "Towards Understanding Grokking: An Effective Theory of Representation Learning". In Koyejo, Sanmi; Mohamed, S.; Agarwal, A.; Belgrave, Danielle; Cho, K.; Oh, A. (eds.). Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 – December 9, 2022. arXiv: 2205.10343 .
↑ Fan, Simin; Pascanu, Razvan; Jaggi, Martin (2024-05-29). "Deep Grokking: Would Deep Neural Networks Generalize Better?". arXiv: 2405.19454 [cs.LG].
↑ Miller, Jack; O'Neill, Charles; Bui, Thang (2024-03-31). "Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity". arXiv: 2310.17247 [cs.LG].
1 2 Liu, Ziming; Michaud, Eric J.; Tegmark, Max (2023). "Omnigrok: Grokking Beyond Algorithmic Data". The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1–5, 2023. OpenReview.net. arXiv: 2210.01117 .
↑ Samothrakis, Spyridon; Matran-Fernandez, Ana; Abdullahi, Umar I.; Fairbank, Michael; Fasli, Maria (2022). "Grokking-like effects in counterfactual inference". International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, July 18-23, 2022. IEEE. pp. 1–8. doi:10.1109/IJCNN55064.2022.9891910. ISBN 978-1-7281-8671-9.
↑ Kumar, Tanishq; Bordelon, Blake; Gershman, Samuel J.; Pehlevan, Cengiz (2024-04-11), Grokking as the Transition from Lazy to Rich Training Dynamics, arXiv, doi:10.48550/arXiv.2310.06110, arXiv:2310.06110, retrieved 2025-02-17
↑ Lyu, Kaifeng; Jin, Jikai; Li, Zhiyuan; Du, Simon S.; Lee, Jason D.; Hu, Wei (2024-04-02), Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking, arXiv, doi:10.48550/arXiv.2311.18817, arXiv:2311.18817, retrieved 2025-02-17
↑ Chizat, Lenaic; Oyallon, Edouard; Bach, Francis (2020-01-07), On Lazy Training in Differentiable Programming, arXiv, doi:10.48550/arXiv.1812.07956, arXiv:1812.07956, retrieved 2025-02-17
↑ Mohamadi, Mohamad Amin; Li, Zhiyuan; Wu, Lei; Sutherland, Danica J. (2024-07-17), Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition, arXiv, doi:10.48550/arXiv.2407.12332, arXiv:2407.12332, retrieved 2025-02-17
↑ Thilak, Vimal; Littwin, Etai; Zhai, Shuangfei; Saremi, Omid; Paiss, Roni; Susskind, Joshua (2022-06-13), The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon, arXiv, doi:10.48550/arXiv.2206.04817, arXiv:2206.04817, retrieved 2025-02-17
↑ Varma, Vikrant; Shah, Rohin; Kenton, Zachary; Kramár, János; Kumar, Ramana (2023-09-05), Explaining grokking through circuit efficiency, arXiv, doi:10.48550/arXiv.2309.02390, arXiv:2309.02390, retrieved 2025-02-17

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 3 Ananthaswamy, Anil (2024-04-12). "How Do Machines 'Grok' Data?". Quanta Magazine. Retrieved 2025-01-21.

[2] Pearce, Adam; Ghandeharioun, Asma; Hussein, Nada; Thain, Nithum; Wattenberg, Martin; Dixon, Lucas. "Do Machine Learning Models Memorize or Generalize?". pair.withgoogle.com. Retrieved 2024-06-04.

[3] Power, Alethea; Burda, Yuri; Edwards, Harri; Babuschkin, Igor; Misra, Vedant (2022-01-06). "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets". arXiv: 2201.02177 [cs.LG].

[4] Minegishi, Gouki; Iwasawa, Yusuke; Matsuo, Yutaka (2024-05-09). "Bridging Lottery ticket and Grokking: Is Weight Norm Sufficient to Explain Delayed Generalization?". arXiv: 2310.19470 [cs.LG].

[5] Liu, Ziming; Kitouni, Ouail; Nolte, Niklas; Michaud, Eric J.; Tegmark, Max; Williams, Mike (2022). "Towards Understanding Grokking: An Effective Theory of Representation Learning". In Koyejo, Sanmi; Mohamed, S.; Agarwal, A.; Belgrave, Danielle; Cho, K.; Oh, A. (eds.). Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 – December 9, 2022. arXiv: 2205.10343 .

[6] Fan, Simin; Pascanu, Razvan; Jaggi, Martin (2024-05-29). "Deep Grokking: Would Deep Neural Networks Generalize Better?". arXiv: 2405.19454 [cs.LG].

[7] Miller, Jack; O'Neill, Charles; Bui, Thang (2024-03-31). "Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity". arXiv: 2310.17247 [cs.LG].

[:1-8] 1 2 Liu, Ziming; Michaud, Eric J.; Tegmark, Max (2023). "Omnigrok: Grokking Beyond Algorithmic Data". The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1–5, 2023. OpenReview.net. arXiv: 2210.01117 .

[9] Samothrakis, Spyridon; Matran-Fernandez, Ana; Abdullahi, Umar I.; Fairbank, Michael; Fasli, Maria (2022). "Grokking-like effects in counterfactual inference". International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, July 18-23, 2022. IEEE. pp. 1–8. doi:10.1109/IJCNN55064.2022.9891910. ISBN 978-1-7281-8671-9.

[10] Kumar, Tanishq; Bordelon, Blake; Gershman, Samuel J.; Pehlevan, Cengiz (2024-04-11), Grokking as the Transition from Lazy to Rich Training Dynamics, arXiv, doi:10.48550/arXiv.2310.06110, arXiv:2310.06110, retrieved 2025-02-17

[11] Lyu, Kaifeng; Jin, Jikai; Li, Zhiyuan; Du, Simon S.; Lee, Jason D.; Hu, Wei (2024-04-02), Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking, arXiv, doi:10.48550/arXiv.2311.18817, arXiv:2311.18817, retrieved 2025-02-17

[12] Chizat, Lenaic; Oyallon, Edouard; Bach, Francis (2020-01-07), On Lazy Training in Differentiable Programming, arXiv, doi:10.48550/arXiv.1812.07956, arXiv:1812.07956, retrieved 2025-02-17

[13] Mohamadi, Mohamad Amin; Li, Zhiyuan; Wu, Lei; Sutherland, Danica J. (2024-07-17), Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition, arXiv, doi:10.48550/arXiv.2407.12332, arXiv:2407.12332, retrieved 2025-02-17

[14] Thilak, Vimal; Littwin, Etai; Zhai, Shuangfei; Saremi, Omid; Paiss, Roni; Susskind, Joshua (2022-06-13), The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon, arXiv, doi:10.48550/arXiv.2206.04817, arXiv:2206.04817, retrieved 2025-02-17

[15] Varma, Vikrant; Shah, Rohin; Kenton, Zachary; Kramár, János; Kumar, Ramana (2023-09-05), Explaining grokking through circuit efficiency, arXiv, doi:10.48550/arXiv.2309.02390, arXiv:2309.02390, retrieved 2025-02-17

[2]

[3]

[4]

[1]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

Grokking (machine learning)

Contents

Transition

See also

References