Margin-infused relaxed algorithm

Last updated July 04, 2024

Margin-infused relaxed algorithm (MIRA)^[1] is a machine learning algorithm, an online algorithm for multiclass classification problems. It is designed to learn a set of parameters (vector or matrix) by processing all the given training examples one-by-one and updating the parameters according to each training example, so that the current training example is classified correctly with a margin against incorrect classifications at least as large as their loss.^[2] The change of the parameters is kept as small as possible.

A two-class version called binary MIRA^[1] simplifies the algorithm by not requiring the solution of a quadratic programming problem (see below). When used in a one-vs-all configuration, binary MIRA can be extended to a multiclass learner that approximates full MIRA, but may be faster to train.

The flow of the algorithm^[3]^[4] looks as follows:

Algorithm MIRA   Input: Training examples  $T=\{x_{i},y_{i}\}$ Output: Set of parameters  $w$

 $i$  ← 0,  $w^{(0)}$  ← 0   for $n$  ← 1 to $N$ for $t$  ← 1 to $|T|$  $w^{(i+1)}$  ← update  $w^{(i)}$  according to  $\{x_{t},y_{t}\}$  $i$  ←  $i+1$ end forend forreturn ${\frac {\sum _{j=1}^{N\times |T|}w^{(j)}}{N\times |T|}}$

"←" denotes assignment. For instance, "largest←item" means that the value of largest changes to the value of item.
"return" terminates the algorithm and outputs the following value.

The update step is then formalized as a quadratic programming ^[2] problem: Find $min\|w^{(i+1)}-w^{(i)}\|$ , so that $score(x_{t},y_{t})-score(x_{t},y')\geq L(y_{t},y')\ \forall y'$ , i.e. the score of the current correct training $y$ must be greater than the score of any other possible $y'$ by at least the loss (number of errors) of that $y'$ in comparison to $y$ .

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> Paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use. 5–12–23

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set.

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

When classification is performed by a computer, statistical methods are normally used to develop the algorithm.

Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

The structured support-vector machine is a machine learning algorithm that generalizes the Support-Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structured SVM allows training of a classifier for general structured output labels.

Sequential minimal optimization (SMO) is an algorithm for solving the quadratic programming (QP) problem that arises during the training of support-vector machines (SVM). It was invented by John Platt in 1998 at Microsoft Research. SMO is widely used for training support vector machines and is implemented by the popular LIBSVM tool. The publication of the SMO algorithm in 1998 has generated a lot of excitement in the SVM community, as previously available methods for SVM training were much more complex and required expensive third-party QP solvers.

In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes.

Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values.

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).

Jubatus is an open-source online machine learning and distributed computing framework developed at Nippon Telegraph and Telephone and Preferred Infrastructure. Its features include classification, recommendation, regression, anomaly detection and graph mining. It supports many client languages, including C++, Java, Ruby and Python. It uses Iterative Parameter Mixture for distributed machine learning.

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or when combining classifiers into ensembles.

Domain adaptation is a field associated with machine learning and transfer learning. This scenario arises when we aim at learning a model from a source data distribution and applying that model on a different target data distribution. For instance, one of the tasks of the common spam filtering problem consists in adapting a model from one user to a new user who receives significantly different emails. Domain adaptation has also been shown to be beneficial to learning unrelated sources. Note that, when more than one source distribution is available the problem is referred to as multi-source domain adaptation.

Weak supervision is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is characterized by using a combination of a small amount of human-labeled data, followed by a large amount of unlabeled data. In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam. Technically, it could be viewed as performing clustering and then labeling the clusters with the labeled data, pushing the decision boundary away from high-density regions, or learning an underlying one-dimensional manifold where the data reside.

References

1 2 Crammer, Koby; Singer, Yoram (2003). "Ultraconservative Online Algorithms for Multiclass Problems". Journal of Machine Learning Research . 3: 951–991.
1 2 McDonald, Ryan; Crammer, Koby; Pereira, Fernando (2005). "Online Large-Margin Training of Dependency Parsers" (PDF). Proceedings of the 43rd Annual Meeting of the ACL. Association for Computational Linguistics. pp. 91–98.
↑ Watanabe, T. et al (2007): "Online Large Margin Training for Statistical Machine Translation". In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 764–773.
↑ Bohnet, B. (2009): Efficient Parsing of Syntactic and Semantic Dependency Structures. Proceedings of Conference on Natural Language Learning (CoNLL), Boulder, 67–72.

External links

adMIRAble – MIRA implementation in C++
Miralium – MIRA implementation in Java
MIRA implementation for Mahout in Hadoop

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[crammer-singer2003-1] 1 2 Crammer, Koby; Singer, Yoram (2003). "Ultraconservative Online Algorithms for Multiclass Problems". Journal of Machine Learning Research . 3: 951–991.

[mcdonald-etal2005-2] 1 2 McDonald, Ryan; Crammer, Koby; Pereira, Fernando (2005). "Online Large-Margin Training of Dependency Parsers" (PDF). Proceedings of the 43rd Annual Meeting of the ACL. Association for Computational Linguistics. pp. 91–98.

[3] Watanabe, T. et al (2007): "Online Large Margin Training for Statistical Machine Translation". In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 764–773.

[4] Bohnet, B. (2009): Efficient Parsing of Syntactic and Semantic Dependency Structures. Proceedings of Conference on Natural Language Learning (CoNLL), Boulder, 67–72.

[1]

[2]

[3]

[4]