Transfer learning

Last updated
Illustration of transfer learning Transfer learning.svg
Illustration of transfer learning

Transfer learning (TL) is a technique in machine learning (ML) in which knowledge learned from a task is re-used in order to boost performance on a related task. [1] For example, for image classification, knowledge gained while learning to recognize cars could be applied when trying to recognize trucks. This topic is related to the psychological literature on transfer of learning, although practical ties between the two fields are limited. Reusing/transferring information from previously learned tasks to new tasks has the potential to significantly improve learning efficiency. [2]

Contents

Since transfer learning makes use of training with multiple objective functions it is related to cost-sensitive machine learning and multi-objective optimization. [3]

History

In 1976, Bozinovski and Fulgosi published a paper addressing transfer learning in neural network training. [4] [5] The paper gives a mathematical and geometrical model of the topic. In 1981, a report considered the application of transfer learning to a dataset of images representing letters of computer terminals, experimentally demonstrating positive and negative transfer learning. [6]

In 1992, Pratt formulated the discriminability-based transfer (DBT) algorithm. [7]

In 1997, Pratt and Thrun guest-edited a special issue of Machine Learning devoted to transfer learning, [8] and by 1998, the field had advanced to include multi-task learning, [9] along with more formal theoretical foundations. [10] Learning to Learn, [11] edited by Thrun and Pratt, is a 1998 review of the subject.

Transfer learning has been applied in cognitive science. Pratt guest-edited an issue of Connection Science on reuse of neural networks through transfer in 1996. [12]

Ng said in his NIPS 2016 tutorial [13] [14] [15] that TL would become the next driver of machine learning commercial success after supervised learning.

In the 2020 paper, "Rethinking Pre-Training and self-training", [16] Zoph et al. reported that pre-training can hurt accuracy, and advocate self-training instead.

Applications

Algorithms are available for transfer learning in Markov logic networks [17] and Bayesian networks. [18] Transfer learning has been applied to cancer subtype discovery, [19] building utilization, [20] [21] general game playing, [22] text classification, [23] [24] digit recognition, [25] medical imaging and spam filtering. [26]

In 2020, it was discovered that, due to their similar physical natures, transfer learning is possible between electromyographic (EMG) signals from the muscles and classifying the behaviors of electroencephalographic (EEG) brainwaves, from the gesture recognition domain to the mental state recognition domain. It was noted that this relationship worked in both directions, showing that electroencephalographic can likewise be used to classify EMG. [27] The experiments noted that the accuracy of neural networks and convolutional neural networks were improved [28] through transfer learning both prior to any learning (compared to standard random weight distribution) and at the end of the learning process (asymptote). That is, results are improved by exposure to another domain. Moreover, the end-user of a pre-trained model can change the structure of fully-connected layers to improve performance. [29]

Software

Transfer learning and domain adaptation Transfer learning and domain adaptation.png
Transfer learning and domain adaptation

Several compilations of transfer learning and domain adaptation algorithms have been implemented:

See also

Related Research Articles

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the neuronal organization found in the biological neural networks in animal brains.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.

Unsupervised learning is a method in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and then generate imaginative content from it.

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Early versions of MTL were called "hints".

<span class="mw-page-title-main">Long short-term memory</span> Artificial recurrent neural network architecture used in deep learning

Long short-term memory (LSTM) network is a recurrent neural network (RNN), aimed at dealing with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on artificial neural networks (ANNs) with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">MNIST database</span> Database of handwritten digits

The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

<span class="mw-page-title-main">Domain adaptation</span> Field associated with machine learning and transfer learning

Domain adaptation is a field associated with machine learning and transfer learning. This scenario arises when we aim at learning a model from a source data distribution and applying that model on a different target data distribution. For instance, one of the tasks of the common spam filtering problem consists in adapting a model from one user to a new user who receives significantly different emails. Domain adaptation has also been shown to be beneficial to learning unrelated sources. Note that, when more than one source distribution is available the problem is referred to as multi-source domain adaptation.

Data augmentation is a statistical technique which allows maximum likelihood estimation from incomplete data. Data augmentation has important applications in Bayesian analysis, and the technique is widely used in machine learning to reduce overfitting when training machine learning models, achieved by training models on several slightly-modified copies of existing data.

The following outline is provided as an overview of and topical guide to machine learning:

Connectionist temporal classification (CTC) is a type of neural network output and associated scoring function, for training recurrent neural networks (RNNs) such as LSTM networks to tackle sequence problems where the timing is variable. It can be used for tasks like on-line handwriting recognition or recognizing phonemes in speech audio. CTC refers to the outputs and scoring, and is independent of the underlying neural network structure. It was introduced in 2006.

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par or outperform hand-designed architectures. Methods for NAS can be categorized according to the search space, search strategy and performance estimation strategy used:

Multi-task optimization is a paradigm in the optimization literature that focuses on solving multiple self-contained tasks simultaneously. The paradigm has been inspired by the well-established concepts of transfer learning and multi-task learning in predictive analytics.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

Conformal prediction (CP) is a machine learning framework for uncertainty quantification that produces statistically valid prediction regions for any underlying point predictor only assuming exchangeability of the data. CP works by computing nonconformity scores on previously labeled data, and using these to create prediction sets on a new (unlabeled) test data point. A transductive version of CP was first proposed in 1998 by Gammerman, Vovk, and Vapnik, and since, several variants of conformal prediction have been developed with different computational complexities, formal guarantees, and practical applications.

Lorien Pratt is an American computer scientist known for inventing two disciplines: machine learning transfer and decision intelligence. She is chief scientist and founder of Quantellia. Since 1988, she has conducted research on the use of machine learning as an academic, professor, industry analyst, and practicing data scientist. Pratt received her AB degree in Computer Science from Dartmouth College and her Masters and doctorate degrees in Computer Science from Rutgers University.

References

  1. West, Jeremy; Ventura, Dan; Warnick, Sean (2007). "Spring Research Presentation: A Theoretical Foundation for Inductive Transfer". Brigham Young University, College of Physical and Mathematical Sciences. Archived from the original on 2007-08-01. Retrieved 2007-08-05.
  2. George Karimpanal, Thommen; Bouffanais, Roland (2019). "Self-organizing maps for storage and transfer of knowledge in reinforcement learning". Adaptive Behavior. 27 (2): 111–126. arXiv: 1811.08318 . doi:10.1177/1059712318818568. ISSN   1059-7123. S2CID   53774629.
  3. Cost-Sensitive Machine Learning. (2011). USA: CRC Press, Page 63, https://books.google.de/books?id=8TrNBQAAQBAJ&pg=PA63
  4. Stevo. Bozinovski and Ante Fulgosi (1976). "The influence of pattern similarity and transfer learning upon the training of a base perceptron B2." (original in Croatian) Proceedings of Symposium Informatica 3-121-5, Bled.
  5. Stevo Bozinovski (2020) "Reminder of the first paper on transfer learning in neural networks, 1976". Informatica 44: 291–302.
  6. S. Bozinovski (1981). "Teaching space: A representation concept for adaptive pattern classification." COINS Technical Report, the University of Massachusetts at Amherst, No 81-28 [available online: UM-CS-1981-028.pdf]
  7. Pratt, L. Y. (1992). "Discriminability-based transfer between neural networks" (PDF). NIPS Conference: Advances in Neural Information Processing Systems 5. Morgan Kaufmann Publishers. pp. 204–211.
  8. Pratt, L. Y.; Thrun, Sebastian (July 1997). "Machine Learning - Special Issue on Inductive Transfer". link.springer.com. Springer. Retrieved 2017-08-10.
  9. Caruana, R., "Multitask Learning", pp. 95-134 in Thrun & Pratt 2012
  10. Baxter, J., "Theoretical Models of Learning to Learn", pp. 71-95 Thrun & Pratt 2012
  11. Thrun & Pratt 2012.
  12. Pratt, L. (1996). "Special Issue: Reuse of Neural Networks through Transfer". Connection Science. 8 (2). Retrieved 2017-08-10.
  13. NIPS 2016 tutorial: "Nuts and bolts of building AI applications using Deep Learning" by Andrew Ng, archived from the original on 2021-12-19, retrieved 2019-12-28
  14. "NIPS 2016 Schedule". nips.cc. Retrieved 2019-12-28.
  15. Nuts and bolts of building AI applications using Deep Learning, slides
  16. Zoph, Barret (2020). "Rethinking pre-training and self-training" (PDF). Advances in Neural Information Processing Systems. 33: 3833–3845. arXiv: 2006.06882 . Retrieved 2022-12-20.
  17. Mihalkova, Lilyana; Huynh, Tuyen; Mooney, Raymond J. (July 2007), "Mapping and Revising Markov Logic Networks for Transfer" (PDF), Learning Proceedings of the 22nd AAAI Conference on Artificial Intelligence (AAAI-2007), Vancouver, BC, pp. 608–614, retrieved 2007-08-05{{citation}}: CS1 maint: location missing publisher (link)
  18. Niculescu-Mizil, Alexandru; Caruana, Rich (March 21–24, 2007), "Inductive Transfer for Bayesian Network Structure Learning" (PDF), Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS 2007), retrieved 2007-08-05
  19. Hajiramezanali, E. & Dadaneh, S. Z. & Karbalayghareh, A. & Zhou, Z. & Qian, X. Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada. arXiv : 1810.09433
  20. Arief-Ang, I.B.; Salim, F.D.; Hamilton, M. (2017-11-08). DA-HOC: semi-supervised domain adaptation for room occupancy prediction using CO2 sensor data. 4th ACM International Conference on Systems for Energy-Efficient Built Environments (BuildSys). Delft, Netherlands. pp. 1–10. doi:10.1145/3137133.3137146. ISBN   978-1-4503-5544-5.
  21. Arief-Ang, I.B.; Hamilton, M.; Salim, F.D. (2018-12-01). "A Scalable Room Occupancy Prediction with Transferable Time Series Decomposition of CO2 Sensor Data". ACM Transactions on Sensor Networks. 14 (3–4): 21:1–21:28. doi:10.1145/3217214. S2CID   54066723.
  22. Banerjee, Bikramjit, and Peter Stone. "General Game Learning Using Knowledge Transfer." IJCAI. 2007.
  23. Do, Chuong B.; Ng, Andrew Y. (2005). "Transfer learning for text classification". Neural Information Processing Systems Foundation, NIPS*2005 (PDF). Retrieved 2007-08-05.
  24. Rajat, Raina; Ng, Andrew Y.; Koller, Daphne (2006). "Constructing Informative Priors using Transfer Learning". Twenty-third International Conference on Machine Learning (PDF). Retrieved 2007-08-05.
  25. Maitra, D. S.; Bhattacharya, U.; Parui, S. K. (August 2015). "CNN based common approach to handwritten character recognition of multiple scripts". 2015 13th International Conference on Document Analysis and Recognition (ICDAR). pp. 1021–1025. doi:10.1109/ICDAR.2015.7333916. ISBN   978-1-4799-1805-8. S2CID   25739012.
  26. Bickel, Steffen (2006). "ECML-PKDD Discovery Challenge 2006 Overview". ECML-PKDD Discovery Challenge Workshop (PDF). Retrieved 2007-08-05.
  27. Bird, Jordan J.; Kobylarz, Jhonatan; Faria, Diego R.; Ekart, Aniko; Ribeiro, Eduardo P. (2020). "Cross-Domain MLP and CNN Transfer Learning for Biological Signal Processing: EEG and EMG". IEEE Access. 8. Institute of Electrical and Electronics Engineers (IEEE): 54789–54801. doi: 10.1109/access.2020.2979074 . ISSN   2169-3536.
  28. Maitra, Durjoy Sen; Bhattacharya, Ujjwal; Parui, Swapan K. (August 2015). "CNN based common approach to handwritten character recognition of multiple scripts". 2015 13th International Conference on Document Analysis and Recognition (ICDAR). pp. 1021–1025. doi:10.1109/ICDAR.2015.7333916. ISBN   978-1-4799-1805-8. S2CID   25739012.
  29. Kabir, H. M. Dipu; Abdar, Moloud; Jalali, Seyed Mohammad Jafar; Khosravi, Abbas; Atiya, Amir F.; Nahavandi, Saeid; Srinivasan, Dipti (January 7, 2022). "SpinalNet: Deep Neural Network with Gradual Input". IEEE Transactions on Artificial Intelligence: 1–13. arXiv: 2007.03347 . doi:10.1109/TAI.2022.3185179. S2CID   220381239.
  30. de Mathelin, Antoine and Deheeger, François and Richard, Guillaume and Mougeot, Mathilde and Vayatis, Nicolas (2020) "ADAPT: Awesome Domain Adaptation Python Toolbox"
  31. Mingsheng Long Junguang Jiang, Bo Fu. (2020) "Transfer-learning-library"
  32. Ke Yan. (2016) "Domain adaptation toolbox"

Sources