Curriculum learning

Last updated

Curriculum learning is a technique in machine learning in which a model is trained on examples of increasing difficulty, where the definition of "difficulty" may be provided externally or discovered automatically as part of the training process. This intended to attain good performance more quickly, or to converge to a better local optimum if the global optimum is not found. [1] [2]

Contents

Approach

Most generally, curriculum learning is the technique of successively increasing the difficulty of examples in the training set that is presented to a model over multiple training iterations. This can produce better results than exposing the model to the full training set immediately under some circumstances; most typically, when the model is able to learn general principles from easier examples, and then gradually incorporate more complex and nuanced information as harder examples are introduced, such as edge cases. This has been shown to work in many domains, most likely as a form of regularization. [3]

There are several major variations in how the technique is applied:

Since curriculum learning only concerns the selection and ordering of training data, it can be combined with many other techniques in machine learning. The success of the method assumes that a model trained for an easier version of the problem can generalize to harder versions, so it can be seen as a form of transfer learning. Some authors also consider curriculum learning to include other forms of progressively increasing complexity, such as increasing the number of model parameters. [11] It is frequently combined with reinforcement learning, such as learning a simplified version of a game first. [12]

Some domains have shown success with anti-curriculum learning: training on the most difficult examples first. One example is the ACCAN method for speech recognition, which trains on the examples with the highest signal-to-noise ratio first. [13]

History

The term "curriculum learning" was introduced by Yoshua Bengio et al in 2009, [14] with reference to the psychological technique of shaping in animals and structured education for humans: beginning with the simplest concepts and then building on them. The authors also note that the application of this technique in machine learning has its roots in the early study of neural networks such as Jeffrey Elman's 1993 paper Learning and development in neural networks: the importance of starting small. [15] Bengio et al showed good results for problems in image classification, such as identifying geometric shapes with progressively more complex forms, and language modeling, such as training with a gradually expanding vocabulary. They conclude that, for curriculum strategies, "their beneficial effect is most pronounced on the test set", suggesting good generalization.

The technique has since been applied to many other domains:

Related Research Articles

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

<span class="mw-page-title-main">Jürgen Schmidhuber</span> German computer scientist

Jürgen Schmidhuber is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He is a scientific director of the Dalle Molle Institute for Artificial Intelligence Research in Switzerland. He is also director of the Artificial Intelligence Initiative and professor of the Computer Science program in the Computer, Electrical, and Mathematical Sciences and Engineering (CEMSE) division at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

Meta learning is a subfield of machine learning where automatic learning algorithms are applied to metadata about machine learning experiments. As of 2017, the term had not found a standard interpretation, however the main goal is to use such metadata to understand how automatic learning can become flexible in solving learning problems, hence to improve the performance of existing learning algorithms or to learn (induce) the learning algorithm itself, hence the alternative term learning to learn.

<span class="mw-page-title-main">Activation function</span> Artificial neural network node function

The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is nonlinear. Modern activation functions include the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model, the logistic (sigmoid) function used in the 2012 speech recognition model developed by Hinton et al, the ReLU used in the 2012 AlexNet computer vision model and in the 2015 ResNet model.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on artificial neural networks (ANNs) with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

<span class="mw-page-title-main">MNIST database</span> Database of handwritten digits

The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

Google Brain was a deep learning artificial intelligence research team under the umbrella of Google AI, a research division at Google dedicated to artificial intelligence. Formed in 2011, it combined open-ended machine learning research with information systems and large-scale computing resources. It created tools such as TensorFlow, which allow neural networks to be used by the public, and multiple internal AI research projects, and aimed to create research opportunities in machine learning and natural language processing. It was merged into former Google sister company DeepMind to form Google DeepMind in April 2023.

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

<span class="mw-page-title-main">Yoshua Bengio</span> Canadian computer scientist

Yoshua Bengio is a Canadian computer scientist, most noted for his work on artificial neural networks and deep learning. He is a professor at the Department of Computer Science and Operations Research at the Université de Montréal and scientific director of the Montreal Institute for Learning Algorithms (MILA).

<span class="mw-page-title-main">Apache SINGA</span> Open-source machine learning library

Apache SINGA is an Apache top-level project for developing an open source machine learning library. It provides a flexible architecture for scalable distributed training, is extensible to run over a wide range of hardware, and has a focus on health-care applications.

<span class="mw-page-title-main">AlexNet</span> Convolutional neural network

AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto.

Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by neural circuitry. While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron. Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling that period an "AI winter".

An energy-based model (EBM) (also called a Canonical Ensemble Learning(CEL) or Learning via Canonical Ensemble (LCE)) is an application of canonical ensemble formulation of statistical physics for learning from data problems. The approach prominently appears in generative models (GMs).

Samy Bengio is a Canadian computer scientist, Senior Director of AI and Machine Learning Research at Apple, and a former long-time scientist at Google known for leading a large group of researchers working in machine learning including adversarial settings. Bengio left Google shortly after the company fired his report, Timnit Gebru, without first notifying him. At the time, Bengio said that he had been "stunned" by what happened to Gebru. He is also among the three authors who developed Torch in 2002, the ancestor of PyTorch, one of today's two largest machine learning frameworks.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

Tensor informally refers in machine learning to two different concepts that organize and represent data. Data may be organized in a multidimensional array (M-way array) that is informally referred to as a "data tensor"; however in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor") may be analyzed either by artificial neural networks or tensor methods.

References

  1. Guo, Sheng; Huang, Weilin; Zhang, Haozhi; Zhuang, Chenfan; Dong, Dengke; Scott, Matthew R.; Huang, Dinglong (2018). "CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images". arXiv: 1808.01097 [cs.CV].
  2. "Competence-based curriculum learning for neural machine translation" . Retrieved March 29, 2024.
  3. 1 2 Bengio, Yoshua; Louradour, Jérôme; Collobert, Ronan; Weston, Jason (2009). "Curriculum Learning". Proceedings of the 26th Annual International Conference on Machine Learning. pp. 41–48. doi:10.1145/1553374.1553380. ISBN   978-1-60558-516-1 . Retrieved March 24, 2024.
  4. "Curriculum learning of multiple tasks" . Retrieved March 29, 2024.
  5. Ionescu, Radu Tudor; Alexe, Bogdan; Leordeanu, Marius; Popescu, Marius; Papadopoulos, Dim P.; Ferrari, Vittorio (2016). "How Hard Can It Be? Estimating the Difficulty of Visual Search in an Image". 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (PDF). pp. 2157–2166. doi:10.1109/CVPR.2016.237. ISBN   978-1-4673-8851-1 . Retrieved March 29, 2024.
  6. "Baby Steps: How "Less is More" in unsupervised dependency parsing" (PDF). Retrieved March 29, 2024.
  7. "Self-paced learning for latent variable models". 6 December 2010. pp. 1189–1197. Retrieved March 29, 2024.
  8. Tang, Ye; Yang, Yu-Bin; Gao, Yang (2012). "Self-paced dictionary learning for image classification". Proceedings of the 20th ACM international conference on Multimedia. pp. 833–836. doi:10.1145/2393347.2396324. ISBN   978-1-4503-1089-5 . Retrieved March 29, 2024.
  9. "Curriculum learning with diversity for supervised computer vision tasks" . Retrieved March 29, 2024.
  10. "Self-paced Curriculum Learning" . Retrieved March 29, 2024.
  11. Soviany, Petru; Radu Tudor Ionescu; Rota, Paolo; Sebe, Nicu (2021). "Curriculum learning: A Survey". arXiv: 2101.10382 [cs.LG].
  12. Narvekar, Sanmit; Peng, Bei; Leonetti, Matteo; Sinapov, Jivko; Taylor, Matthew E.; Stone, Peter (January 2020). "Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey". The Journal of Machine Learning Research. 21 (1): 181:7382–181:7431. arXiv: 2003.04960 . Retrieved March 29, 2024.
  13. "A Curriculum Learning Method for Improved Noise Robustness in Automatic Speech Recognition" . Retrieved March 29, 2024.
  14. Bengio, Yoshua; Louradour, Jérôme; Collobert, Ronan; Weston, Jason (2009). "Curriculum Learning". Proceedings of the 26th Annual International Conference on Machine Learning. pp. 41–48. doi:10.1145/1553374.1553380. ISBN   978-1-60558-516-1 . Retrieved March 24, 2024.
  15. Elman, J. L. (1993). "Learning and development in neural networks: the importance of starting small". Cognition. 48 (1): 71–99. doi:10.1016/0010-0277(93)90058-4. PMID   8403835 . Retrieved March 29, 2024.
  16. "Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning" . Retrieved March 29, 2024.
  17. Gong, Yantao; Liu, Cao; Yuan, Jiazhen; Yang, Fan; Cai, Xunliang; Wan, Guanglu; Chen, Jiansong; Niu, Ruiyao; Wang, Houfeng (2021). "Density-based dynamic curriculum learning for intent detection". Proceedings of the 30th ACM International Conference on Information & Knowledge Management. pp. 3034–3037. arXiv: 2108.10674 . doi:10.1145/3459637.3482082. ISBN   978-1-4503-8446-9 . Retrieved March 29, 2024.
  18. "Visualizing and understanding curriculum learning for long short-term memory networks" . Retrieved March 29, 2024.
  19. "An empirical exploration of curriculum learning for neural machine translation" . Retrieved March 29, 2024.
  20. "Reinforcement learning based curriculum optimization for neural machine translation" . Retrieved March 29, 2024.
  21. "A curriculum learning method for improved noise robustness in automatic speechrecognition" . Retrieved March 29, 2024.
  22. Huang, Yuge; Wang, Yuhan; Tai, Ying; Liu, Xiaoming; Shen, Pengcheng; Li, Shaoxin; Li, Jilin; Huang, Feiyue (2020). "CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition". 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5900–5909. arXiv: 2004.00288 . doi:10.1109/CVPR42600.2020.00594. ISBN   978-1-7281-7168-5 . Retrieved March 29, 2024.
  23. "Curriculum self-paced learning for cross-domain object detection" . Retrieved March 29, 2024.
  24. "Automatic curriculum graph generation for reinforcement learning agents". 4 February 2017. pp. 2590–2596. Retrieved March 29, 2024.
  25. Gong, Chen; Yang, Jian; Tao, Dacheng (2019). "Multi-modal curriculum learning over graphs". ACM Transactions on Intelligent Systems and Technology. 10 (4): 1–25. doi:10.1145/3322122 . Retrieved March 29, 2024.
  26. Qu, Meng; Tang, Jian; Han, Jiawei (2018). Curriculum learning for heterogeneous star network embedding via deep reinforcement learning. pp. 468–476. doi:10.1145/3159652.3159711. hdl:2142/101634. ISBN   978-1-4503-5581-0 . Retrieved March 29, 2024.
  27. Self-paced learning for matrix factorization. 25 January 2015. pp. 3196–3202. ISBN   978-0-262-51129-2 . Retrieved March 29, 2024.

Further reading