Multiple instance learning

Last updated

In machine learning, multiple-instance learning (MIL) is a type of supervised learning. Instead of receiving a set of instances which are individually labeled, the learner receives a set of labeled bags, each containing many instances. In the simple case of multiple-instance binary classification, a bag may be labeled negative if all the instances in it are negative. On the other hand, a bag is labeled positive if there is at least one instance in it which is positive. From a collection of labeled bags, the learner tries to either (i) induce a concept that will label individual instances correctly or (ii) learn how to label bags without inducing the concept.

Contents

Babenko (2008) [1] gives a simple example for MIL. Imagine several people, and each of them has a key chain that contains few keys. Some of these people are able to enter a certain room, and some aren't. The task is then to predict whether a certain key or a certain key chain can get you into that room. To solve this problem we need to find the exact key that is common for all the "positive" key chains. If we can correctly identify this key, we can also correctly classify an entire key chain - positive if it contains the required key, or negative if it doesn't.

Machine learning

Depending on the type and variation in training data, machine learning can be roughly categorized into three frameworks: supervised learning, unsupervised learning, and reinforcement learning. Multiple instance learning (MIL) falls under the supervised learning framework, where every training instance has a label, either discrete or real valued. MIL deals with problems with incomplete knowledge of labels in training sets. More precisely, in multiple-instance learning, the training set consists of labeled "bags", each of which is a collection of unlabeled instances. A bag is positively labeled if at least one instance in it is positive, and is negatively labeled if all instances in it are negative. The goal of the MIL is to predict the labels of new, unseen bags.

History

Keeler et al., [2] in his work in the early 1990s was the first one to explore the area of MIL. The actual term multi-instance learning was introduced in the middle of the 1990s, by Dietterich et al. while they were investigating the problem of drug activity prediction. [3] They tried to create a learning system that could predict whether new molecule was qualified to make some drug, or not, through analyzing a collection of known molecules. Molecules can have many alternative low-energy states, but only one, or some of them, are qualified to make a drug. The problem arose because scientists could only determine if molecule is qualified, or not, but they couldn't say exactly which of its low-energy shapes are responsible for that.

One of the proposed ways to solve this problem was to use supervised learning, and regard all the low-energy shapes of the qualified molecule as positive training instances, while all of the low-energy shapes of unqualified molecules as negative instances. Dietterich et al. showed that such method would have a high false positive noise, from all low-energy shapes that are mislabeled as positive, and thus wasn't really useful. [3] Their approach was to regard each molecule as a labeled bag, and all the alternative low-energy shapes of that molecule as instances in the bag, without individual labels. Thus formulating multiple-instance learning.

Solution to the multiple instance learning problem that Dietterich et al. proposed is the axis-parallel rectangle (APR) algorithm. [3] It attempts to search for appropriate axis-parallel rectangles constructed by the conjunction of the features. They tested the algorithm on Musk dataset, [4] [5] [ dubious discuss ] which is a concrete test data of drug activity prediction and the most popularly used benchmark in multiple-instance learning. APR algorithm achieved the best result, but APR was designed with Musk data in mind.

Problem of multi-instance learning is not unique to drug finding. In 1998, Maron and Ratan found another application of multiple instance learning to scene classification in machine vision, and devised Diverse Density framework. [6] Given an image, an instance is taken to be one or more fixed-size subimages, and the bag of instances is taken to be the entire image. An image is labeled positive if it contains the target scene - a waterfall, for example - and negative otherwise. Multiple instance learning can be used to learn the properties of the subimages which characterize the target scene. From there on, these frameworks have been applied to a wide spectrum of applications, ranging from image concept learning and text categorization, to stock market prediction.

Examples

Take image classification for example Amores (2013). Given an image, we want to know its target class based on its visual content. For instance, the target class might be "beach", where the image contains both "sand" and "water". In MIL terms, the image is described as a bag, where each is the feature vector (called instance) extracted from the corresponding -th region in the image and is the total regions (instances) partitioning the image. The bag is labeled positive ("beach") if it contains both "sand" region instances and "water" region instances.

Examples of where MIL is applied are:

Numerous researchers have worked on adapting classical classification techniques, such as support vector machines or boosting, to work within the context of multiple-instance learning.

Definitions

If the space of instances is , then the set of bags is the set of functions , which is isomorphic to the set of multi-subsets of . For each bag and each instance , is viewed as the number of times occurs in . [8] Let be the space of labels, then a "multiple instance concept" is a map . The goal of MIL is to learn such a concept. The remainder of the article will focus on binary classification, where .

Assumptions

Most of the work on multiple instance learning, including Dietterich et al. (1997) and Maron & Lozano-Pérez (1997) early papers, [3] [9] make the assumption regarding the relationship between the instances within a bag and the class label of the bag. Because of its importance, that assumption is often called standard MI assumption.

Standard assumption

The standard assumption takes each instance to have an associated label which is hidden to the learner. The pair is called an "instance-level concept". A bag is now viewed as a multiset of instance-level concepts, and is labeled positive if at least one of its instances has a positive label, and negative if all of its instances have negative labels. Formally, let be a bag. The label of is then . Standard MI assumption is asymmetric, which means that if the positive and negative labels are reversed, the assumption has a different meaning. Because of that, when we use this assumption, we need to be clear which label should be the positive one.

Standard assumption might be viewed as too strict, and therefore in the recent years, researchers tried to relax that position, which gave rise to other more loose assumptions. [10] Reason for this is the belief that standard MI assumption is appropriate for the Musk dataset, but since MIL can be applied to numerous other problems, some different assumptions could probably be more appropriate. Guided by that idea, Weidmann [11] formulated a hierarchy of generalized instance-based assumptions for MIL. It consists of the standard MI assumption and three types of generalized MI assumptions, each more general than the last, in the sense that the former can be obtained as a specific choice of parameters of the latter, standard presence-based threshold-based count-based, with the count-based assumption being the most general and the standard assumption being the least general. (Note however, that any bag meeting the count-based assumption meets the threshold-based assumption which in turn meets the presence-based assumption which, again in turn, meet the standard assumption. In that sense it is also correct to state that the standard assumption is the weakest, hence most general, and the count-based assumption is the strongest, hence least general.) One would expect an algorithm which performs well under one of these assumptions to perform at least as well under the less general assumptions.

Presence-, threshold-, and count-based assumptions

The presence-based assumption is a generalization of the standard assumption, wherein a bag must contain all instances that belong to a set of required instance-level concepts in order to be labeled positive. Formally, let be the set of required instance-level concepts, and let denote the number of times the instance-level concept occurs in the bag . Then for all . Note that, by taking to contain only one instance-level concept, the presence-based assumption reduces to the standard assumption.

A further generalization comes with the threshold-based assumption, where each required instance-level concept must occur not only once in a bag, but some minimum (threshold) number of times in order for the bag to be labeled positive. With the notation above, to each required instance-level concept is associated a threshold . For a bag , for all .

The count-based assumption is a final generalization which enforces both lower and upper bounds for the number of times a required concept can occur in a positively labeled bag. Each required instance-level concept has a lower threshold and upper threshold with . A bag is labeled according to for all .

GMIL assumption

Scott, Zhang, and Brown (2005) [12] describe another generalization of the standard model, which they call "generalized multiple instance learning" (GMIL). The GMIL assumption specifies a set of required instances . A bag is labeled positive if it contains instances which are sufficiently close to at least of the required instances . [12] Under only this condition, the GMIL assumption is equivalent to the presence-based assumption. [8] However, Scott et al. describe a further generalization in which there is a set of attraction points and a set of repulsion points . A bag is labeled positive if and only if it contains instances which are sufficiently close to at least of the attraction points and are sufficiently close to at most of the repulsion points. [12] This condition is strictly more general than the presence-based, though it does not fall within the above hierarchy.

Collective assumption

In contrast to the previous assumptions where the bags were viewed as fixed, the collective assumption views a bag as a distribution over instances , and similarly view labels as a distribution over instances. The goal of an algorithm operating under the collective assumption is then to model the distribution .

Since is typically considered fixed but unknown, algorithms instead focus on computing the empirical version: , where is the number of instances in bag . Since is also typically taken to be fixed but unknown, most collective-assumption based methods focus on learning this distribution, as in the single-instance version. [8] [10]

While the collective assumption weights every instance with equal importance, Foulds extended the collective assumption to incorporate instance weights. The weighted collective assumption is then that , where is a weight function over instances and . [8]

Algorithms

MIL Framework Mildiag.jpg
MIL Framework

There are two major flavors of algorithms for Multiple Instance Learning: instance-based and metadata-based, or embedding-based algorithms. The term "instance-based" denotes that the algorithm attempts to find a set of representative instances based on an MI assumption and classify future bags from these representatives. By contrast, metadata-based algorithms make no assumptions about the relationship between instances and bag labels, and instead try to extract instance-independent information (or metadata) about the bags in order to learn the concept. [10] For a survey of some of the modern MI algorithms see Foulds and Frank. [8]

Instance-based algorithms

The earliest proposed MI algorithms were a set of "iterated-discrimination" algorithms developed by Dietterich et al., and Diverse Density developed by Maron and Lozano-Pérez. [3] [9] Both of these algorithms operated under the standard assumption.

Iterated-discrimination

Broadly, all of the iterated-discrimination algorithms consist of two phases. The first phase is to grow an axis parallel rectangle (APR) which contains at least one instance from each positive bag and no instances from any negative bags. This is done iteratively: starting from a random instance in a positive bag, the APR is expanded to the smallest APR covering any instance in a new positive bag . This process is repeated until the APR covers at least one instance from each positive bag. Then, each instance contained in the APR is given a "relevance", corresponding to how many negative points it excludes from the APR if removed. The algorithm then selects candidate representative instances in order of decreasing relevance, until no instance contained in a negative bag is also contained in the APR. The algorithm repeats these growth and representative selection steps until convergence, where APR size at each iteration is taken to be only along candidate representatives.

After the first phase, the APR is thought to tightly contain only the representative attributes. The second phase expands this tight APR as follows: a Gaussian distribution is centered at each attribute and a looser APR is drawn such that positive instances will fall outside the tight APR with fixed probability. [4] Though iterated discrimination techniques work well with the standard assumption, they do not generalize well to other MI assumptions. [8]

Diverse Density

In its simplest form, Diverse Density (DD) assumes a single representative instance as the concept. This representative instance must be "dense" in that it is much closer to instances from positive bags than from negative bags, as well as "diverse" in that it is close to at least one instance from each positive bag.

Let be the set of positively labeled bags and let be the set of negatively labeled bags, then the best candidate for the representative instance is given by , where the diverse density under the assumption that bags are independently distributed given the concept . Letting denote the jth instance of bag i, the noisy-or model gives:

is taken to be the scaled distance where is the scaling vector. This way, if every positive bag has an instance close to , then will be high for each , but if any negative bag has an instance close to , will be low. Hence, is high only if every positive bag has an instance close to and no negative bags have an instance close to . The candidate concept can be obtained through gradient methods. Classification of new bags can then be done by evaluating proximity to . [9] Though Diverse Density was originally proposed by Maron et al. in 1998, more recent MIL algorithms use the DD framework, such as EM-DD in 2001 [13] and DD-SVM in 2004, [14] and MILES in 2006 [8]

A number of single-instance algorithms have also been adapted to a multiple-instance context under the standard assumption, including

Post 2000, there was a movement away from the standard assumption and the development of algorithms designed to tackle the more general assumptions listed above. [10]

  • Weidmann [11] proposes a Two-Level Classification (TLC) algorithm to learn concepts under the count-based assumption. The first step tries to learn instance-level concepts by building a decision tree from each instance in each bag of the training set. Each bag is then mapped to a feature vector based on the counts in the decision tree. In the second step, a single-instance algorithm is run on the feature vectors to learn the concept
  • Scott et al. [12] proposed an algorithm, GMIL-1, to learn concepts under the GMIL assumption in 2005. GMIL-1 enumerates all axis-parallel rectangles in the original space of instances, and defines a new feature space of Boolean vectors. A bag is mapped to a vector in this new feature space, where if APR covers , and otherwise. A single-instance algorithm can then be applied to learn the concept in this new feature space.

Because of the high dimensionality of the new feature space and the cost of explicitly enumerating all APRs of the original instance space, GMIL-1 is inefficient both in terms of computation and memory. GMIL-2 was developed as a refinement of GMIL-1 in an effort to improve efficiency. GMIL-2 pre-processes the instances to find a set of candidate representative instances. GMIL-2 then maps each bag to a Boolean vector, as in GMIL-1, but only considers APRs corresponding to unique subsets of the candidate representative instances. This significantly reduces the memory and computational requirements. [8]

  • Xu (2003) [10] proposed several algorithms based on logistic regression and boosting methods to learn concepts under the collective assumption.

Metadata-based (or embedding-based) algorithms

By mapping each bag to a feature vector of metadata, metadata-based algorithms allow the flexibility of using an arbitrary single-instance algorithm to perform the actual classification task. Future bags are simply mapped (embedded) into the feature space of metadata and labeled by the chosen classifier. Therefore, much of the focus for metadata-based algorithms is on what features or what type of embedding leads to effective classification. Note that some of the previously mentioned algorithms, such as TLC and GMIL could be considered metadata-based.

They define two variations of kNN, Bayesian-kNN and citation-kNN, as adaptations of the traditional nearest-neighbor problem to the multiple-instance setting.

Generalizations

So far this article has considered multiple instance learning exclusively in the context of binary classifiers. However, the generalizations of single-instance binary classifiers can carry over to the multiple-instance case.

See also

Related Research Articles

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

Description logics (DL) are a family of formal knowledge representation languages. Many DLs are more expressive than propositional logic but less expressive than first-order logic. In contrast to the latter, the core reasoning problems for DLs are (usually) decidable, and efficient decision procedures have been designed and implemented for these problems. There are general, spatial, temporal, spatiotemporal, and fuzzy description logics, and each description logic features a different balance between expressive power and reasoning complexity by supporting different sets of mathematical constructors.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Inherently, Multi-task learning is a multi-objective optimization problem having trade-offs between different tasks. Early versions of MTL were called "hints".

Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bioinformatics.

Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees. Random forests correct for decision trees' habit of overfitting to their training set.

Empirical risk minimization is a principle in statistical learning theory which defines a family of learning algorithms based on evaluating performance over a known and fixed dataset. The core idea is based on an application of the law of large numbers; more specifically, we cannot know exactly how well a predictive algorithm will work in practice because we do not know the true distribution of the data, but we can instead estimate and optimize the performance of the algorithm on a known set of training data. The performance over the known set of training data is referred to as the "empirical risk".

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes and hence also belongs to the class of margin-maximizing supervised classification algorithms. Consider a classification function

An alternating decision tree (ADTree) is a machine learning method for classification. It generalizes decision trees and has connections to boosting.

The structured support-vector machine is a machine learning algorithm that generalizes the Support-Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structured SVM allows training of a classifier for general structured output labels.

The sample complexity of a machine learning algorithm represents the number of training-samples that it needs in order to successfully learn a target function.

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

Sparse dictionary learning is a representation learning method which aims at finding a sparse representation of the input data in the form of a linear combination of basic elements as well as those basic elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal, and they may be an over-complete spanning set. This problem setup also allows the dimensionality of the signals being represented to be higher than the one of the signals being observed. The above two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal but also provide an improvement in sparsity and flexibility of the representation.

In statistical learning theory, a learnable function class is a set of functions for which an algorithm can be devised to asymptotically minimize the expected risk, uniformly over all probability distributions. The concept of learnable classes are closely related to regularization in machine learning, and provides large sample justifications for certain learning algorithms.

Weak supervision is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is characterized by using a combination of a small amount of human-labeled data, followed by a large amount of unlabeled data. In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam. Technically, it could be viewed as performing clustering and then labeling the clusters with the labeled data, pushing the decision boundary away from high-density regions, or learning an underlying one-dimensional manifold where the data reside.

References

  1. Babenko, Boris. "Multiple instance learning: algorithms and applications." View Article PubMed/NCBI Google Scholar (2008).
  2. Keeler, James D., David E. Rumelhart, and Wee-Kheng Leow. Integrated Segmentation and Recognition of Hand-Printed Numerals. Microelectronics and Computer Technology Corporation, 1991.
  3. 1 2 3 4 5 Dietterich, Thomas G., Richard H. Lathrop, and Tomás Lozano-Pérez. "Solving the multiple instance problem with axis-parallel rectangles." Artificial intelligence 89.1 (1997): 31-71.
  4. 1 2 C. Blake, E. Keogh, and C.J. Merz. UCI repository of machine learning databases , Department of Information and Computer Science, University of California, Irvine, CA, 1998.
  5. Wang, Wei-Hong; Du, Yan-ye; Li, Qu; Fang, Zhao-lin (2011). "Credit Evaluation Based on Gene Expression Programming and Clonal Selection". Procedia Engineering. 15: 3759–3763. doi: 10.1016/j.proeng.2011.08.704 .
  6. O. Maron and A.L. Ratan. Multiple-instance learning for natural scene classification. In Proceedings of the 15th International Conference on Machine Learning, Madison, WI, pp.341–349, 1998.
  7. Minhas, F. u. A. A; Ben-Hur, A (2012). "Multiple instance learning of Calmodulin binding sites". Bioinformatics. 28 (18): i416–i422. doi:10.1093/bioinformatics/bts416. PMC   3436843 . PMID   22962461.
  8. 1 2 3 4 5 6 7 8 9 10 11 Foulds, James, and Eibe Frank. "A review of multi-instance learning assumptions." The Knowledge Engineering Review 25.01 (2010): 1-25.
  9. 1 2 3 Maron, Oded, and Tomás Lozano-Pérez. "A framework for multiple-instance learning." Advances in neural information processing systems (1998): 570-576
  10. 1 2 3 4 5 Xu, X. Statistical learning in multiple instance problems. Master's thesis, University of Waikato (2003).
  11. 1 2 Weidmann, Nils B. "Two-level classification for generalized multi-instance data." Diss. Albert-Ludwigs-Universität, 2003.
  12. 1 2 3 4 Scott, Stephen, Jun Zhang, and Joshua Brown. "On generalized multiple-instance learning." International Journal of Computational Intelligence and Applications 5.01 (2005): 21-35.
  13. Zhang, Qi, and Sally A. Goldman. "EM-DD: An improved multiple-instance learning technique." Advances in neural information processing systems. (2001): 1073 - 80
  14. Chen, Yixin, and James Z. Wang. "Image categorization by learning and reasoning with regions." The Journal of Machine Learning Research 5 (2004): 913-939
  15. Andrews, Stuart, Ioannis Tsochantaridis, and Thomas Hofmann. "Support vector machines for multiple-instance learning." Advances in neural information processing systems (2003). pp 561 - 658
  16. Zhou, Zhi-Hua, and Min-Ling Zhang. "Neural networks for multi-instance learning." Proceedings of the International Conference on Intelligent Information Technology, Beijing, China. (2002). pp 455 - 459
  17. Blockeel, Hendrik, David Page, and Ashwin Srinivasan. "Multi-instance tree learning." Proceedings of the 22nd international conference on Machine learning. ACM, 2005. pp 57- 64
  18. Auer, Peter, and Ronald Ortner. "A boosting approach to multiple instance learning." Machine Learning: ECML 2004. Springer Berlin Heidelberg, 2004. 63-74.
  19. Chen, Yixin; Bi, Jinbo; Wang, J. Z. (2006-12-01). "MILES: Multiple-Instance Learning via Embedded Instance Selection". IEEE Transactions on Pattern Analysis and Machine Intelligence. 28 (12): 1931–1947. doi:10.1109/TPAMI.2006.248. ISSN   0162-8828. PMID   17108368. S2CID   18137821.
  20. Cheplygina, Veronika; Tax, David M. J.; Loog, Marco (2015-01-01). "Multiple instance learning with bag dissimilarities". Pattern Recognition. 48 (1): 264–275. arXiv: 1309.5643 . Bibcode:2015PatRe..48..264C. doi:10.1016/j.patcog.2014.07.022. S2CID   17606924.
  21. Wang, Jun, and Jean-Daniel Zucker. "Solving multiple-instance problem: A lazy learning approach." ICML (2000): 1119-25
  22. Zhou, Zhi-Hua, and Min-Ling Zhang. "Multi-instance multi-label learning with application to scene classification." Advances in Neural Information Processing Systems. 2006. pp 1609 - 16
  23. Ray, Soumya, and David Page. "Multiple instance regression." ICML. Vol. 1. 2001. pp 425 - 32

Further reading

Recent reviews of the MIL literature include: