Error-driven learning

Last updated November 04, 2024

In reinforcement learning, error-driven learning is a method for adjusting a model's (intelligent agent's) parameters based on the difference between its output results and the ground truth. These models stand out as they depend on environmental feedback, rather than explicit labels or categories.^[1] They are based on the idea that language acquisition involves the minimization of the prediction error (MPSE).^[2] By leveraging these prediction errors, the models consistently refine expectations and decrease computational complexity. Typically, these algorithms are operated by the GeneRec algorithm.^[3]

Error-driven learning has widespread applications in cognitive sciences and computer vision. These methods have also found successful application in natural language processing (NLP), including areas like part-of-speech tagging,^[4] parsing,^[4] named entity recognition (NER),^[5] machine translation (MT),^[6] speech recognition (SR),^[4] and dialogue systems.^[7]

Formal Definition

Error-driven learning models are ones that rely on the feedback of prediction errors to adjust the expectations or parameters of a model. The key components of error-driven learning include the following:

A set $S$ of states representing the different situations that the learner can encounter.
A set $A$ of actions that the learner can take in each state.
A prediction function $P(s,a)$ that gives the learner’s current prediction of the outcome of taking action $a$ in state $s$ .
An error function $E(o,p)$ that compares the actual outcome $o$ with the prediction $p$ and produces an error value.
An update rule $U(p,e)$ that adjusts the prediction $p$ in light of the error $e$ .^[2]

Algorithms

Error-driven learning algorithms refer to a category of reinforcement learning algorithms that leverage the disparity between the real output and the expected output of a system to regulate the system's parameters. Typically applied in supervised learning, these algorithms are provided with a collection of input-output pairs to facilitate the process of generalization. ^[2]

The widely utilized error backpropagation learning algorithm is known as GeneRec, a generalized recirculation algorithm primarily employed for gene prediction in DNA sequences. Many other error-driven learning algorithms are derived from alternative versions of GeneRec. ^[3]

Applications

Cognitive science

Simpler error-driven learning models effectively capture complex human cognitive phenomena and anticipate elusive behaviors. They provide a flexible mechanism for modeling the brain's learning process, encompassing perception, attention, memory, and decision-making. By using errors as guiding signals, these algorithms adeptly adapt to changing environmental demands and objectives, capturing statistical regularities and structure. ^[2]

Furthermore, cognitive science has led to the creation of new error-driven learning algorithms that are both biologically acceptable and computationally efficient. These algorithms, including deep belief networks, spiking neural networks, and reservoir computing, follow the principles and constraints of the brain and nervous system. Their primary aim is to capture the emergent properties and dynamics of neural circuits and systems.^[2]^[8]

Computer vision

Computer vision is a complex task that involves understanding and interpreting visual data, such as images or videos.^[9]

In the context of error-driven learning, the computer vision model learns from the mistakes it makes during the interpretation process. When an error is encountered, the model updates its internal parameters to avoid making the same mistake in the future. This repeated process of learning from errors helps improve the model’s performance over time.^[9]

For NLP to do well at computer vision, it employs deep learning techniques. This form of computer vision is sometimes called neural computer vision (NCV), since it makes use of neural networks. NCV therefore interprets visual data based on a statistical, trial and error approach and can deal with context and other subtleties of visual data.^[9]

Natural Language Processing

Part-of-speech tagging

Part-of-speech (POS) tagging is a crucial component in Natural Language Processing (NLP). It helps resolve human language ambiguity at different analysis levels. In addition, its output (tagged data) can be used in various applications of NLP such as information extraction, information retrieval, question Answering, speech eecognition, text-to-speech conversion, partial parsing, and grammar correction.^[4]

Parsing

Parsing in NLP involves breaking down a text into smaller pieces (phrases) based on grammar rules. If a sentence cannot be parsed, it may contain grammatical errors.

In the context of error-driven learning, the parser learns from the mistakes it makes during the parsing process. When an error is encountered, the parser updates its internal model to avoid making the same mistake in the future. This iterative process of learning from errors helps improve the parser’s performance over time.^[4]

In conclusion, error-driven learning plays a crucial role in improving the accuracy and efficiency of NLP parsers by allowing them to learn from their mistakes and adapt their internal models accordingly.

Named entity recognition (NER)

NER is the task of identifying and classifying entities (such as persons, locations, organizations, etc.) in a text. Error-driven learning can help the model learn from its false positives and false negatives and improve its recall and precision on (NER).^[5]

In the context of error-driven learning, the significance of NER is quite profound. Traditional sequence labeling methods identify nested entities layer by layer. If an error occurs in the recognition of an inner entity, it can lead to incorrect identification of the outer entity, leading to a problem known as error propagation of nested entities.^[10]^[11]

This is where the role of NER becomes crucial in error-driven learning. By accurately recognizing and classifying entities, it can help minimize these errors and improve the overall accuracy of the learning process. Furthermore, deep learning-based NER methods have shown to be more accurate as they are capable of assembling words, enabling them to understand the semantic and syntactic relationship between various words better.^[10]^[11]

Machine translation

Machine translation is a complex task that involves converting text from one language to another.^[6] In the context of error-driven learning, the machine translation model learns from the mistakes it makes during the translation process. When an error is encountered, the model updates its internal parameters to avoid making the same mistake in the future. This iterative process of learning from errors helps improve the model’s performance over time.^[12]

Speech recognition

Speech recognition is a complex task that involves converting spoken language into written text. In the context of error-driven learning, the speech recognition model learns from the mistakes it makes during the recognition process. When an error is encountered, the model updates its internal parameters to avoid making the same mistake in the future. This iterative process of learning from errors helps improve the model’s performance over time.^[13]

Dialogue systems

Dialogue systems are a popular NLP task as they have promising real-life applications. They are also complicated tasks since many NLP tasks deserving study are involved.

In the context of error-driven learning, the dialogue system learns from the mistakes it makes during the dialogue process. When an error is encountered, the model updates its internal parameters to avoid making the same mistake in the future. This iterative process of learning from errors helps improve the model’s performance over time.^[7]

Advantages

Error-driven learning has several advantages over other types of machine learning algorithms:

They can learn from feedback and correct their mistakes, which makes them adaptive and robust to noise and changes in the data.
They can handle large and high-dimensional data sets, as they do not require explicit feature engineering or prior knowledge of the data distribution.
They can achieve high accuracy and performance, as they can learn complex and nonlinear relationships between the input and the output.^[2]

Limitations

Although error driven learning has its advantages, their algorithms also have the following limitations:

They can suffer from overfitting, which means that they memorize the training data and fail to generalize to new and unseen data. This can be mitigated by using regularization techniques, such as adding a penalty term to the loss function, or reducing the complexity of the model.^[14]
They can be sensitive to the choice of the error function, the learning rate, the initialization of the weights, and other hyperparameters, which can affect the convergence and the quality of the solution. This requires careful tuning and experimentation, or using adaptive methods that adjust the hyperparameters automatically.
They can be computationally expensive and time-consuming, especially for nonlinear and deep models, as they require multiple iterations(repetitions) and calculations to update the weights of the system. This can be alleviated by using parallel and distributed computing, or using specialized hardware such as GPUs or TPUs.^[2]

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> Paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data to expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

Recurrent neural networks (RNNs) are a class of artificial neural network commonly used for sequential data processing. Unlike feedforward neural networks, which process data in a single pass, RNNs process data across multiple time steps, making them well-adapted for modelling and processing text, speech, and time series.

Named-entity recognition (NER) (also known as (named)entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.

<span class="mw-page-title-main">Echo state network</span> Type of reservoir computer

An echo state network (ESN) is a type of reservoir computer that uses a recurrent neural network with a sparsely connected hidden layer. The connectivity and weights of hidden neurons are fixed and randomly assigned. The weights of output neurons can be learned so that the network can produce or reproduce specific temporal patterns. The main interest of this network is that although its behavior is non-linear, the only weights that are modified during training are for the synapses that connect the hidden neurons to output neurons. Thus, the error function is quadratic with respect to the parameter vector and can be differentiated easily to a linear system.

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models, and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps. The name is made in analogy with long-term memory and short-term memory and their relationship, studied by cognitive psychologists since the early 20th century.

Structured prediction or structured output learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than discrete or real values.

There are many types of artificial neural networks (ANN).

A constrained conditional model (CCM) is a machine learning and inference framework that augments the learning of conditional models with declarative constraints. The constraint can be used as a way to incorporate expressive prior knowledge into the model and bias the assignments made by the learned model to satisfy these constraints. The framework can be used to support decisions in an expressive output space while maintaining modularity and tractability of training and inference.

Deep learning is a subset of machine learning that focuses on utilizing neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience and is centered around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

The following outline is provided as an overview of and topical guide to natural-language processing:

In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Bidirectional recurrent neural networks (BRNN) connect two hidden layers of opposite directions to the same output. With this form of generative deep learning, the output layer can get information from past (backwards) and future (forward) states simultaneously. Invented in 1997 by Schuster and Paliwal, BRNNs were introduced to increase the amount of input information available to the network. For example, multilayer perceptron (MLPs) and time delay neural network (TDNNs) have limitations on the input data flexibility, as they require their input data to be fixed. Standard recurrent neural network (RNNs) also have restrictions as the future input information cannot be reached from the current state. On the contrary, BRNNs do not require their input data to be fixed. Moreover, their future input information is reachable from the current state.

The following outline is provided as an overview of and topical guide to machine learning:

Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent text as a sequence of vectors using self-supervised learning. It uses the encoder-only transformer architecture. It is notable for its dramatic improvement over previous state-of-the-art models, and as an early example of a large language model. As of 2020, BERT is a ubiquitous baseline in natural language processing (NLP) experiments.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

References

↑ Sadre, Ramin; Pras, Aiko (2009-06-19). Scalability of Networks and Services: Third International Conference on Autonomous Infrastructure, Management and Security, AIMS 2009 Enschede, The Netherlands, June 30 - July 2, 2009, Proceedings. Springer. ISBN 978-3-642-02627-0.
1 2 3 4 5 6 7 Hoppe, Dorothée B.; Hendriks, Petra; Ramscar, Michael; van Rij, Jacolien (2022-10-01). "An exploration of error-driven learning in simple two-layer networks from a discriminative learning perspective". Behavior Research Methods. 54 (5): 2221–2251. doi:10.3758/s13428-021-01711-5. ISSN 1554-3528. PMC 9579095 . PMID 35032022.
1 2 O'Reilly, Randall C. (1996-07-01). "Biologically Plausible Error-Driven Learning Using Local Activation Differences: The Generalized Recirculation Algorithm". Neural Computation. 8 (5): 895–938. doi:10.1162/neco.1996.8.5.895. ISSN 0899-7667.
1 2 3 4 5 Mohammad, Saif, and Ted Pedersen. "Combining lexical and syntactic features for supervised word sense disambiguation." Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004. 2004. APA
1 2 Florian, Radu, et al. "Named entity recognition through classifier combination." Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003. 2003.
1 2 Rozovskaya, Alla, and Dan Roth. "Grammatical error correction: Machine translation and classifiers." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016.
1 2 Iosif, Elias; Klasinas, Ioannis; Athanasopoulou, Georgia; Palogiannidi, Elisavet; Georgiladakis, Spiros; Louka, Katerina; Potamianos, Alexandros (2018-01-01). "Speech understanding for spoken dialogue systems: From corpus harvesting to grammar rule induction". Computer Speech & Language. 47: 272–297. doi:10.1016/j.csl.2017.08.002. ISSN 0885-2308.
↑ Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1), 1-127
1 2 3 Voulodimos, Athanasios; Doulamis, Nikolaos; Doulamis, Anastasios; Protopapadakis, Eftychios (2018-02-01). "Deep Learning for Computer Vision: A Brief Review". Computational Intelligence and Neuroscience. 2018: e7068349. doi: 10.1155/2018/7068349 . ISSN 1687-5265. PMC 5816885 . PMID 29487619.
1 2 Chang, Haw-Shiuan; Vembu, Shankar; Mohan, Sunil; Uppaal, Rheeya; McCallum, Andrew (2020-09-01). "Using error decay prediction to overcome practical issues of deep active learning for named entity recognition". Machine Learning. 109 (9): 1749–1778. arXiv: 1911.07335 . doi: 10.1007/s10994-020-05897-1 . ISSN 1573-0565.
1 2 Gao, Wenchao; Li, Yu; Guan, Xiaole; Chen, Shiyu; Zhao, Shanshan (2022-08-25). "Research on Named Entity Recognition Based on Multi-Task Learning and Biaffine Mechanism". Computational Intelligence and Neuroscience. 2022: e2687615. doi: 10.1155/2022/2687615 . ISSN 1687-5265. PMC 9436550 . PMID 36059424.
↑ Tan, Zhixing; Wang, Shuo; Yang, Zonghan; Chen, Gang; Huang, Xuancheng; Sun, Maosong; Liu, Yang (2020-01-01). "Neural machine translation: A review of methods, resources, and tools". AI Open. 1: 5–21. arXiv: 2012.15515 . doi: 10.1016/j.aiopen.2020.11.001 . ISSN 2666-6510.
↑ A. Thakur, L. Ahuja, R. Vashisth and R. Simon, "NLP & AI Speech Recognition: An Analytical Review," 2023 10th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 2023, pp. 1390-1396.
↑ Ajila, Samuel A.; Lung, Chung-Horng; Das, Anurag (2022-06-01). "Analysis of error-based machine learning algorithms in network anomaly detection and categorization". Annals of Telecommunications. 77 (5): 359–370. doi:10.1007/s12243-021-00836-0. ISSN 1958-9395.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] Sadre, Ramin; Pras, Aiko (2009-06-19). Scalability of Networks and Services: Third International Conference on Autonomous Infrastructure, Management and Security, AIMS 2009 Enschede, The Netherlands, June 30 - July 2, 2009, Proceedings. Springer. ISBN 978-3-642-02627-0.

[:1-2] 1 2 3 4 5 6 7 Hoppe, Dorothée B.; Hendriks, Petra; Ramscar, Michael; van Rij, Jacolien (2022-10-01). "An exploration of error-driven learning in simple two-layer networks from a discriminative learning perspective". Behavior Research Methods. 54 (5): 2221–2251. doi:10.3758/s13428-021-01711-5. ISSN 1554-3528. PMC 9579095 . PMID 35032022.

[:6-3] 1 2 O'Reilly, Randall C. (1996-07-01). "Biologically Plausible Error-Driven Learning Using Local Activation Differences: The Generalized Recirculation Algorithm". Neural Computation. 8 (5): 895–938. doi:10.1162/neco.1996.8.5.895. ISSN 0899-7667.

[:2-4] 1 2 3 4 5 Mohammad, Saif, and Ted Pedersen. "Combining lexical and syntactic features for supervised word sense disambiguation." Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004. 2004. APA

[:3-5] 1 2 Florian, Radu, et al. "Named entity recognition through classifier combination." Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003. 2003.

[:4-6] 1 2 Rozovskaya, Alla, and Dan Roth. "Grammatical error correction: Machine translation and classifiers." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016.

[:5-7] 1 2 Iosif, Elias; Klasinas, Ioannis; Athanasopoulou, Georgia; Palogiannidi, Elisavet; Georgiladakis, Spiros; Louka, Katerina; Potamianos, Alexandros (2018-01-01). "Speech understanding for spoken dialogue systems: From corpus harvesting to grammar rule induction". Computer Speech & Language. 47: 272–297. doi:10.1016/j.csl.2017.08.002. ISSN 0885-2308.

[8] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1), 1-127

[:7-9] 1 2 3 Voulodimos, Athanasios; Doulamis, Nikolaos; Doulamis, Anastasios; Protopapadakis, Eftychios (2018-02-01). "Deep Learning for Computer Vision: A Brief Review". Computational Intelligence and Neuroscience. 2018: e7068349. doi: 10.1155/2018/7068349 . ISSN 1687-5265. PMC 5816885 . PMID 29487619.

[:8-10] 1 2 Chang, Haw-Shiuan; Vembu, Shankar; Mohan, Sunil; Uppaal, Rheeya; McCallum, Andrew (2020-09-01). "Using error decay prediction to overcome practical issues of deep active learning for named entity recognition". Machine Learning. 109 (9): 1749–1778. arXiv: 1911.07335 . doi: 10.1007/s10994-020-05897-1 . ISSN 1573-0565.

[:9-11] 1 2 Gao, Wenchao; Li, Yu; Guan, Xiaole; Chen, Shiyu; Zhao, Shanshan (2022-08-25). "Research on Named Entity Recognition Based on Multi-Task Learning and Biaffine Mechanism". Computational Intelligence and Neuroscience. 2022: e2687615. doi: 10.1155/2022/2687615 . ISSN 1687-5265. PMC 9436550 . PMID 36059424.

[12] Tan, Zhixing; Wang, Shuo; Yang, Zonghan; Chen, Gang; Huang, Xuancheng; Sun, Maosong; Liu, Yang (2020-01-01). "Neural machine translation: A review of methods, resources, and tools". AI Open. 1: 5–21. arXiv: 2012.15515 . doi: 10.1016/j.aiopen.2020.11.001 . ISSN 2666-6510.

[13] A. Thakur, L. Ahuja, R. Vashisth and R. Simon, "NLP & AI Speech Recognition: An Analytical Review," 2023 10th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 2023, pp. 1390-1396.

[14] Ajila, Samuel A.; Lung, Chung-Horng; Das, Anurag (2022-06-01). "Analysis of error-based machine learning algorithms in network anomaly detection and categorization". Annals of Telecommunications. 77 (5): 359–370. doi:10.1007/s12243-021-00836-0. ISSN 1958-9395.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]