One-class classification

Last updated

In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, tries to identify objects of a specific class amongst all objects, by primarily learning from a training set containing only the objects of that class, [1] although there exist variants of one-class classifiers where counter-examples are used to further refine the classification boundary. This is different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all the classes. Examples include the monitoring of helicopter gearboxes, [2] [3] [4] motor failure prediction, [5] or the operational status of a nuclear plant as 'normal': [6] In this scenario, there are few, if any, examples of catastrophic system states; only the statistics of normal operation are known.

Contents

While many of the above approaches focus on the case of removing a small number of outliers or anomalies, one can also learn the other extreme, where the single class covers a small coherent subset of the data, using an information bottleneck approach. [7]

Overview

The term one-class classification (OCC) was coined by Moya & Hush (1996) [8] and many applications can be found in scientific literature, for example outlier detection, anomaly detection, novelty detection. A feature of OCC is that it uses only sample points from the assigned class, so that a representative sampling is not strictly required for non-target classes. [9]

Introduction

The hypersphere containing the target data having center c and radius r. Objects on the boundary are support vectors, and two objects lie outside the boundary having slack greater than 0. One-class data description TAX.png
The hypersphere containing the target data having center c and radius r. Objects on the boundary are support vectors, and two objects lie outside the boundary having slack greater than 0.

SVM based one-class classification (OCC) relies on identifying the smallest hypersphere (with radius r, and center c) consisting of all the data points. [10] This method is called Support Vector Data Description (SVDD). Formally, the problem can be defined in the following constrained optimization form,

However, the above formulation is highly restrictive, and is sensitive to the presence of outliers. Therefore, a flexible formulation, that allow for the presence of outliers is formulated as shown below,

From Karush-Kuhn-Tucker (KKT) optimality conditions, we get

where the 's are the solution to the following optimization problem:

subject to,

The introduction of kernel function provide additional flexibility to the One-class SVM (OSVM) algorithm. [11]

PU (Positive Unlabeled) learning

A similar problem is PU learning, in which a binary classifier is learned in a semi-supervised way from only positive and unlabeled sample points. [12]

In PU learning, two sets of examples are assumed to be available for training: the positive set and a mixed set, which is assumed to contain both positive and negative samples, but without these being labeled as such. This contrasts with other forms of semisupervised learning, where it is assumed that a labeled set containing examples of both classes is available in addition to unlabeled samples. A variety of techniques exist to adapt supervised classifiers to the PU learning setting, including variants of the EM algorithm. PU learning has been successfully applied to text, [13] [14] [15] time series, [16] bioinformatics tasks, [17] [18] and Remote-Sensing Data. [19]

Approaches

Several approaches have been proposed to solve one-class classification (OCC). The approaches can be distinguished into three main categories, density estimation, boundary methods, and reconstruction methods. [6]

Density estimation methods

Density estimation methods rely on estimating the density of the data points, and set the threshold. These methods rely on assuming distributions, such as Gaussian, or a Poisson distribution. Following which discordancy tests can be used to test the new objects. These methods are robust to scale variance.

Gaussian model [20] is one of the simplest methods to create one-class classifiers. Due to Central Limit Theorem (CLT), [21] these methods work best when large number of samples are present, and they are perturbed by small independent error values. The probability distribution for a d-dimensional object is given by:

Where, is the mean and is the covariance matrix. Computing the inverse of covariance matrix () is the costliest operation, and in the cases where the data is not scaled properly, or data has singular directions pseudo-inverse is used to approximate the inverse, and is calculated as . [22]

Boundary methods

Boundary methods focus on setting boundaries around a few set of points, called target points. These methods attempt to optimize the volume. Boundary methods rely on distances, and hence are not robust to scale variance. K-centers method, NN-d, and SVDD are some of the key examples.

K-centers

In K-center algorithm, [23] small balls with equal radius are placed to minimize the maximum distance of all minimum distances between training objects and the centers. Formally, the following error is minimized,

The algorithm uses forward search method with random initialization, where the radius is determined by the maximum distance of the object, any given ball should capture. After the centers are determined, for any given test object the distance can be calculated as,

Reconstruction methods

Reconstruction methods use prior knowledge and generating process to build a generating model that best fits the data. New objects can be described in terms of a state of the generating model. Some examples of reconstruction methods for OCC are, k-means clustering, learning vector quantization, self-organizing maps, etc.

Applications

Document classification

The basic Support Vector Machine (SVM) paradigm is trained using both positive and negative examples, however studies have shown there are many valid reasons for using only positive examples. When the SVM algorithm is modified to only use positive examples, the process is considered one-class classification. One situation where this type of classification might prove useful to the SVM paradigm is in trying to identify a web browser's sites of interest based only off of the user's browsing history.

Biomedical studies

One-class classification can be particularly useful in biomedical studies where often data from other classes can be difficult or impossible to obtain. In studying biomedical data it can be difficult and/or expensive to obtain the set of labeled data from the second class that would be necessary to perform a two-class classification. A study from The Scientific World Journal found that the typicality approach is the most useful in analysing biomedical data because it can be applied to any type of dataset (continuous, discrete, or nominal). [24] The typicality approach is based on the clustering of data by examining data and placing it into new or existing clusters. [25] To apply typicality to one-class classification for biomedical studies, each new observation, , is compared to the target class, , and identified as an outlier or a member of the target class. [24]

Unsupervised Concept Drift Detection

One-class classification has similarities with unsupervised concept drift detection, where both aim to identify whether the unseen data share similar characteristics to the initial data. A concept is referred to as the fixed probability distribution which data is drawn from. In unsupervised concept drift detection, the goal is to detect if the data distribution changes without utilizing class labels. In one-class classification, the flow of data is not important. Unseen data is classified as typical or outlier depending on its characteristics, whether it is from the initial concept or not. However, unsupervised drift detection monitors the flow of data, and signals a drift if there is a significant amount of change or anomalies. Unsupervised concept drift detection can be identified as the continuous form of one-class classification. [26] One-class classifiers are used for detecting concept drifts. [27]

See also

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> Machine learning task

Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labeled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning algorithms is learning a function that maps feature vectors (inputs) to labels (output), based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

<span class="mw-page-title-main">Support vector machine</span> Set of methods for supervised statistical learning

In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

<span class="mw-page-title-main">Naive Bayes classifier</span> Probabilistic classification algorithm

In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features. They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve high accuracy levels.

<span class="mw-page-title-main">Outlier</span> Observation far apart from others in statistics and data science

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are sometimes excluded from the data set. An outlier can be an indication of exciting possibility, but can also cause serious problems in statistical analyses.

In statistics, a studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. It is a form of a Student's t-statistic, with the estimate of error varying between points.

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

AdaBoost, short for Adaptive Boosting, is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types of learning algorithms to improve performance. The output of the other learning algorithms is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals on the real line.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

<span class="mw-page-title-main">Kernel method</span> Class of algorithms for pattern analysis

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). Kernel methods are types of algorithms that are used for pattern analysis. These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

The constellation model is a probabilistic, generative model for category-level object recognition in computer vision. Like other part-based models, the constellation model attempts to represent an object class by a set of N parts under mutual geometric constraints. Because it considers the geometric relationship between different parts, the constellation model differs significantly from appearance-only, or "bag-of-words" representation models, which explicitly disregard the location of image features.

Sequential minimal optimization (SMO) is an algorithm for solving the quadratic programming (QP) problem that arises during the training of support-vector machines (SVM). It was invented by John Platt in 1998 at Microsoft Research. SMO is widely used for training support vector machines and is implemented by the popular LIBSVM tool. The publication of the SMO algorithm in 1998 has generated a lot of excitement in the SVM community, as previously available methods for SVM training were much more complex and required expensive third-party QP solvers.

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.

In machine learning and data mining, a string kernel is a kernel function that operates on strings, i.e. finite sequences of symbols that need not be of the same length. String kernels can be intuitively understood as functions measuring the similarity of pairs of strings: the more similar two strings a and b are, the higher the value of a string kernel K(a, b) will be.

In machine learning, a Ranking SVM is a variant of the support vector machine algorithm, which is used to solve certain ranking problems. The ranking SVM algorithm was published by Thorsten Joachims in 2002. The original purpose of the algorithm was to improve the performance of an internet search engine. However, it was found that Ranking SVM also can be used to solve other problems such as Rank SIFT.

Geometric feature learning is a technique combining machine learning and computer vision to solve visual tasks. The main goal of this method is to find a set of representative features of geometric form to represent an object by collecting geometric features from images and learning them using efficient machine learning methods. Humans solve visual tasks and can give fast response to the environment by extracting perceptual information from what they see. Researchers simulate humans' ability of recognizing objects to solve computer vision problems. For example, M. Mata et al.(2002) applied feature learning techniques to the mobile robot navigation tasks in order to avoid obstacles. They used genetic algorithms for learning features and recognizing objects (figures). Geometric feature learning methods can not only solve recognition problems but also predict subsequent actions by analyzing a set of sequential input sensory images, usually some extracting features of images. Through learning, some hypothesis of the next action are given and according to the probability of each hypothesis give a most probable action. This technique is widely used in the area of artificial intelligence.

<span class="mw-page-title-main">Foreground detection</span>

Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.

<span class="mw-page-title-main">Multiple kernel learning</span> Set of machine learning methods

Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.

<span class="mw-page-title-main">Manifold regularization</span>

In machine learning, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire input space. For example, a facial recognition system may not need to classify any possible image, but only the subset of images that contain faces. The technique of manifold learning assumes that the relevant subset of data comes from a manifold, a mathematical structure with useful properties. The technique also assumes that the function to be learned is smooth: data with different labels are not likely to be close together, and so the labeling function should not change quickly in areas where there are likely to be many data points. Because of this assumption, a manifold regularization algorithm can use unlabeled data to inform where the learned function is allowed to change quickly and where it is not, using an extension of the technique of Tikhonov regularization. Manifold regularization algorithms can extend supervised learning algorithms in semi-supervised learning and transductive learning settings, where unlabeled data are available. The technique has been used for applications including medical imaging, geographical imaging, and object recognition.

<span class="mw-page-title-main">Weak supervision</span> Branch of machine learning

Weak supervision, also called semi-supervised learning, is a branch of machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Semi-supervised learning falls between unsupervised learning and supervised learning. Semi-supervised learning aims to alleviate the issue of having limited amounts of labeled data available for training.

References

  1. Oliveri P (August 2017). "Class-modelling in food analytical chemistry: Development, sampling, optimisation and validation issues - A tutorial". Analytica Chimica Acta. 982: 9–19. doi:10.1016/j.aca.2017.05.013. hdl: 11567/881059 . PMID   28734370.
  2. Japkowicz N, Myers C, Gluck M (1995). "A Novelty Detection Approach to Classification". pp. 518–523. CiteSeerX   10.1.1.40.3663 .
  3. Japkowicz N (1999). Concept-Learning in the Absence of Counter-Examples:An Autoassociation-Based Approach to Classification (Thesis). Rutgers University.
  4. Japkowicz N (2001). "Supervised Versus Unsupervised Binary-Learning by Feedforward Neural Networks" (PDF). Machine Learning. 42: 97–122. doi: 10.1023/A:1007660820062 . S2CID   7298189.
  5. Petsche T, Marcantonio A, Darken C, Hanson S, Kuhn G, Santoso I (1996). "A Neural Network Autoassociator for Induction Motor Failure Prediction" (PDF). NIPS.
  6. 1 2 Tax D (2001). One-class classification: Concept-learning in the absence of counter-examples (PDF) (Ph.D. thesis). The Netherlands: University of Delft.
  7. Crammer, Koby (2004). "A needle in a haystack: local one-class optimization". ICML Proceedings of the Twenty-First International Conference on Machine Learning: 26. doi:10.1145/1015330.1015399. ISBN   978-1-58113-838-2. S2CID   8736254.
  8. Moya, M.; Hush, D. (1996). "Network constraints and multi- objective optimization for one-class classification". Neural Networks. 9 (3): 463–474. doi:10.1016/0893-6080(95)00120-4.
  9. Rodionova OY, Oliveri P, Pomerantsev AL (2016-12-15). "Rigorous and compliant approaches to one-class classification". Chemometrics and Intelligent Laboratory Systems. 159: 89–96. doi:10.1016/j.chemolab.2016.10.002. hdl: 11567/864539 .
  10. Zineb, Noumir; Honeine, Paul; Richard, Cedue (2012). "On simple one-class classification methods". IEEE International Symposium on Information Theory Proceedings. IEEE, 2012.
  11. Khan, Shehroz S.; Madden, Michael G. (2010). Coyle, Lorcan; Freyne, Jill (eds.). "A Survey of Recent Trends in One Class Classification". Artificial Intelligence and Cognitive Science. Lecture Notes in Computer Science. Springer Berlin Heidelberg. 6206: 188–197. doi:10.1007/978-3-642-17080-5_21. hdl: 10379/1472 . ISBN   978-3-642-17080-5. S2CID   36784649.
  12. Liu, Bing (2007). Web Data Mining. Springer. pp. 165–178.
  13. Bing Liu; Wee Sun Lee; Philip S. Yu & Xiao-Li Li (2002). Partially supervised classification of text documents. ICML. pp. 8–12.
  14. Hwanjo Yu; Jiawei Han; Kevin Chen-Chuan Chang (2002). PEBL: positive example based learning for web page classification using SVM. ACM SIGKDD.
  15. Xiao-Li Li & Bing Liu (2003). Learning to classify text using positive and unlabeled data. IJCAI.
  16. Minh Nhut Nguyen; Xiao-Li Li & See-Kiong Ng (2011). Positive Unlabeled Learning for Time Series Classification. IJCAI.
  17. Peng Yang; Xiao-Li Li; Jian-Ping Mei; Chee-Keong Kwoh & See-Kiong Ng (2012). Positive-Unlabeled Learning for Disease Gene Identification. Bioinformatics, Vol 28(20).
  18. Bugnon, L. A.; Yones, C.; Milone, D. H. & Stegmayer, G. (2020). "Genome-wide discovery of pre-miRNAs: comparison of recent approaches based on machine learning". Oxford Bioinformatics. 22 (3). doi:10.1093/bib/bbaa184. PMID   32814347.
  19. Li, W.; Guo, Q.; Elkan, C. (February 2011). "A Positive and Unlabeled Learning Algorithm for One-Class Classification of Remote-Sensing Data". IEEE Transactions on Geoscience and Remote Sensing. 49 (2): 717–725. Bibcode:2011ITGRS..49..717L. doi:10.1109/TGRS.2010.2058578. ISSN   0196-2892. S2CID   267120.
  20. Bishop, Christopher M.; Bishop, Professor of Neural Computing Christopher M. (1995-11-23). Neural Networks for Pattern Recognition. Clarendon Press. ISBN   978-0-19-853864-6.
  21. Ullman, Neil R (2017-01-01). Elementary statistics.[ dead link ]
  22. "Introduction to Applied Mathematics". SIAM Bookstore. Retrieved 2019-04-29.
  23. Ypma, Alexander; Duin, Robert P. W. (1998). Niklasson, Lars; Bodén, Mikael; Ziemke, Tom (eds.). "Support objects for domain approximation". Icann 98. Perspectives in Neural Computing. Springer London: 719–724. doi:10.1007/978-1-4471-1599-1_110. ISBN   978-1-4471-1599-1.
  24. 1 2 Irigoien I, Sierra B, Arenas C (2014). "Towards application of one-class classification methods to medical data". TheScientificWorldJournal. 2014: 730712. doi: 10.1155/2014/730712 . PMC   3980920 . PMID   24778600.
  25. Irigoien I, Arenas C (July 2008). "INCA: new statistic for estimating the number of clusters and identifying atypical units". Statistics in Medicine. 27 (15): 2948–73. doi:10.1002/sim.3143. PMID   18050154. S2CID   24791212.
  26. Gözüaçık, Ömer; Can, Fazli (November 2020). "Concept learning using one-class classifiers for implicit drift detection in evolving data streams". Artificial Intelligence Review. 54 (5): 3725–3747. doi:10.1007/s10462-020-09939-x. hdl: 11693/77042 . S2CID   229506136.
  27. Krawczyk, Bartosz; Woźniak, Michał (2015). "One-class classifiers with incremental learning and forgetting for data streams with concept drift". Soft Computing. 19 (12): 3387–3400. doi: 10.1007/s00500-014-1492-5 . S2CID   207011971.