Contrast set learning

Last updated

Contrast set learning is a form of association rule learning that seeks to identify meaningful differences between separate groups by reverse-engineering the key predictors that identify for each particular group. For example, given a set of attributes for a pool of students (labeled by degree type), a contrast set learner would identify the contrasting features between students seeking bachelor's degrees and those working toward PhD degrees.

Contents

Overview

A common practice in data mining is to classify, to look at the attributes of an object or situation and make a guess at what category the observed item belongs to. As new evidence is examined (typically by feeding a training set to a learning algorithm), these guesses are refined and improved. Contrast set learning works in the opposite direction. While classifiers read a collection of data and collect information that is used to place new data into a series of discrete categories, contrast set learning takes the category that an item belongs to and attempts to reverse engineer the statistical evidence that identifies an item as a member of a class. That is, contrast set learners seek rules associating attribute values with changes to the class distribution. [1] They seek to identify the key predictors that contrast one classification from another.

For example, an aerospace engineer might record data on test launches of a new rocket. Measurements would be taken at regular intervals throughout the launch, noting factors such as the trajectory of the rocket, operating temperatures, external pressures, and so on. If the rocket launch fails after a number of successful tests, the engineer could use contrast set learning to distinguish between the successful and failed tests. A contrast set learner will produce a set of association rules that, when applied, will indicate the key predictors of each failed tests versus the successful ones (the temperature was too high, the wind pressure was too high, etc.).

Contrast set learning is a form of association rule learning. [2] Association rule learners typically offer rules linking attributes commonly occurring together in a training set (for instance, people who are enrolled in four-year programs and take a full course load tend to also live near campus). Instead of finding rules that describe the current situation, contrast set learners seek rules that differ meaningfully in their distribution across groups (and thus, can be used as predictors for those groups). [3] For example, a contrast set learner could ask, “What are the key identifiers of a person with a bachelor's degree or a person with a PhD, and how do people with PhD's and bachelor’s degrees differ?”

Standard classifier algorithms, such as C4.5, have no concept of class importance (that is, they do not know if a class is "good" or "bad"). Such learners cannot bias or filter their predictions towards certain desired classes. As the goal of contrast set learning is to discover meaningful differences between groups, it is useful to be able to target the learned rules towards certain classifications. Several contrast set learners, such as MINWAL [4] or the family of TAR algorithms, [5] [6] [7] assign weights to each class in order to focus the learned theories toward outcomes that are of interest to a particular audience. Thus, contrast set learning can be thought of as a form of weighted class learning. [8]

Example: Supermarket Purchases

The differences between standard classification, association rule learning, and contrast set learning can be illustrated with a simple supermarket metaphor. In the following small dataset, each row is a supermarket transaction and each "1" indicates that the item was purchased (a "0" indicates that the item was not purchased):

HamburgerPotatoesFoie GrasOnionsChampagnePurpose of Purchases
11010Cookout
11010Cookout
00101Anniversary
11010Cookout
11001Frat Party

Given this data,

Treatment learning


Treatment learning is a form of weighted contrast-set learning that takes a single desirable group and contrasts it against the remaining undesirable groups (the level of desirability is represented by weighted classes). [5] The resulting "treatment" suggests a set of rules that, when applied, will lead to the desired outcome.

Treatment learning differs from standard contrast set learning through the following constraints:

This focus on simplicity is an important goal for treatment learners. Treatment learning seeks the smallest change that has the greatest impact on the class distribution. [8]

Conceptually, treatment learners explore all possible subsets of the range of values for all attributes. Such a search is often infeasible in practice, so treatment learning often focuses instead on quickly pruning and ignoring attribute ranges that, when applied, lead to a class distribution where the desired class is in the minority. [7]

Example: Boston housing data

The following example demonstrates the output of the treatment learner TAR3 on a dataset of housing data from the city of Boston (a nontrivial public dataset with over 500 examples). In this dataset, a number of factors are collected for each house, and each house is classified according to its quality (low, medium-low, medium-high, and high). The desired class is set to "high", and all other classes are lumped together as undesirable.

The output of the treatment learner is as follows:

Baseline class distribution: low: 29% medlow: 29% medhigh: 21% high: 21%  Suggested Treatment: [PTRATIO=[12.6..16), RM=[6.7..9.78)]  New class distribution: low: 0% medlow: 0% medhigh: 3% high: 97%


With no applied treatments (rules), the desired class represents only 21% of the class distribution. However, if one filters the data set for houses with 6.7 to 9.78 rooms and a neighborhood parent-teacher ratio of 12.6 to 16, then 97% of the remaining examples fall into the desired class (high-quality houses).

Algorithms

There are a number of algorithms that perform contrast set learning. The following subsections describe two examples.

STUCCO

The STUCCO contrast set learner [1] [3] treats the task of learning from contrast sets as a tree search problem where the root node of the tree is an empty contrast set. Children are added by specializing the set with additional items picked through a canonical ordering of attributes (to avoid visiting the same nodes twice). Children are formed by appending terms that follow all existing terms in a given ordering. The formed tree is searched in a breadth-first manner. Given the nodes at each level, the dataset is scanned and the support is counted for each group. Each node is then examined to determine if it is significant and large, if it should be pruned, and if new children should be generated. After all significant contrast sets are located, a post-processor selects a subset to show to the user - the low order, simpler results are shown first, followed by the higher order results which are "surprising and significantly different. [3] "

The support calculation comes from testing a null hypothesis that the contrast set support is equal across all groups (i.e., that contrast set support is independent of group membership). The support count for each group is a frequency value that can be analyzed in a contingency table where each row represents the truth value of the contrast set and each column variable indicates the group membership frequency. If there is a difference in proportions between the contrast set frequencies and those of the null hypothesis, the algorithm must then determine if the differences in proportions represent a relation between variables or if it can be attributed to random causes. This can be determined through a chi-square test comparing the observed frequency count to the expected count.

Nodes are pruned from the tree when all specializations of the node can never lead to a significant and large contrast set. The decision to prune is based on:

TAR3

The TAR3 [6] [9] weighted contrast set learner is based on two fundamental concepts - the lift and support of a rule set.

The lift of a set of rules is the change that some decision makes to a set of examples after imposing that decision (i.e., how the class distribution shifts in response to the imposition of a rule). TAR3 seeks the smallest set of rules which induces the biggest changes in the sum of the weights attached to each class multiplied by the frequency at which each class occurs. The lift is calculated by dividing the score of the set in which the set of rules is imposed by the score of the baseline set (i.e., no rules are applied). Note that by reversing the lift scoring function, the TAR3 learner can also select for the remaining classes and reject the target class.

It is problematic to rely on the lift of a rule set alone. Incorrect or misleading data noise, if correlated with failing examples, may result in an overfitted rule set. Such an overfitted model may have a large lift score, but it does not accurately reflect the prevailing conditions within the dataset. To avoid overfitting, TAR3 utilizes a support threshold and rejects all rules that fall on the wrong side of this threshold. Given a target class, the support threshold is a user-supplied value (usually 0.2) which is compared to the ratio of the frequency of the target class when the rule set has been applied to the frequency of that class in the overall dataset. TAR3 rejects all sets of rules with support lower than this threshold.

By requiring both a high lift and a high support value, TAR3 not only returns ideal rule sets, but also favors smaller sets of rules. The fewer rules adopted, the more evidence that will exist supporting those rules.

The TAR3 algorithm only builds sets of rules from attribute value ranges with a high heuristic value. The algorithm determines which ranges to use by first determining the lift score of each attribute’s value ranges. These individual scores are then sorted and converted into a cumulative probability distribution. TAR3 randomly selects values from this distribution, meaning that low-scoring ranges are unlikely to be selected. To build a candidate rule set, several ranges are selected and combined. These candidate rule sets are then scored and sorted. If no improvement is seen after a user-defined number of rounds, the algorithm terminates and returns the top-scoring rule sets.

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> A paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

<span class="mw-page-title-main">Overfitting</span> Flaw in mathematical modelling

In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitted model is a mathematical model that contains more parameters than can be justified by the data. In a mathematical sense, these parameters represent the degree of a polynomial. The essence of overfitting is to have unknowingly extracted some of the residual variation as if that variation represented underlying model structure.

<span class="mw-page-title-main">Decision tree</span> Decision support tool

A decision tree is a decision support hierarchical model that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.

Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

<span class="mw-page-title-main">Learning classifier system</span> Paradigm of rule-based machine learning methods

Learning classifier systems, or LCS, are a paradigm of rule-based machine learning methods that combine a discovery component with a learning component. Learning classifier systems seek to identify a set of context-dependent rules that collectively store and apply knowledge in a piecewise manner in order to make predictions. This approach allows complex solution spaces to be broken up into smaller, simpler parts.

Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

<span class="mw-page-title-main">ID3 algorithm</span> Decision tree algorithm

In decision tree learning, ID3 is an algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used in the machine learning and natural language processing domain

C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. In 2011, authors of the Weka machine learning software described the C4.5 algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date".

In information theory and machine learning, information gain is a synonym for Kullback–Leibler divergence; the amount of information gained about a random variable or signal from observing another random variable. However, in the context of decision trees, the term is sometimes used synonymously with mutual information, which is the conditional expected value of the Kullback–Leibler divergence of the univariate probability distribution of one variable from the conditional distribution of this variable given the other one.

In statistics, the frequency or absolute frequency of an event is the number of times the observation has occurred/recorded in an experiment or study. These frequencies are often depicted graphically or in tabular form.

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

In data mining and association rule learning, lift is a measure of the performance of a targeting model at predicting or classifying cases as having an enhanced response, measured against a random choice targeting model. A targeting model is doing a good job if the response within the target is much better than the baseline average for the population as a whole. Lift is simply the ratio of these values: target response divided by average response. Mathematically,

An alternating decision tree (ADTree) is a machine learning method for classification. It generalizes decision trees and has connections to boosting.

In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes.

Active learning is a special case of machine learning in which a learning algorithm can interactively query a user to label new data points with the desired outputs. In statistics literature, it is sometimes also called optimal experimental design. The information source is also called teacher or oracle.

Isolation Forest is an algorithm for data anomaly detection initially developed by Fei Tony Liu in 2008. Isolation Forest detects anomalies using binary trees. The algorithm has a linear time complexity and a low memory requirement, which works well with high-volume data. In essence, the algorithm relies upon the characteristics of anomalies, i.e., being few and different, in order to detect anomalies. No density estimation is performed in the algorithm. The algorithm is different from decision tree algorithms in that only the path-length measure or approximation is being used to generate the anomaly score, no leaf node statistics on class distribution or target value is needed.

References

  1. 1 2 Stephen Bay; Michael Pazzani (2001). "Detecting group differences: Mining contrast sets" (PDF). Data Mining and Knowledge Discovery. 5 (3): 213–246. doi:10.1023/A:1011429418057. S2CID   2941550.
  2. G.I. Webb; S. Butler; D. Newlands (2003). On Detecting Differences Between Groups. KDD'03 Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  3. 1 2 3 Stephen Bay; Michael Pazzani (1999). Detecting change in categorical data: mining contrast sets. KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining.
  4. C.H. Cai; A.W.C. Fu; C.H. Cheng; W.W. Kwong (1998). Mining association rules with weighted items (PDF). Proceedings of International Database Engineering and Applications Symposium (IDEAS 98).
  5. 1 2 Y. Hu (2003). Treatment learning: Implementation and application (Master's thesis). Department of Electrical Engineering, University of British Columbia.
  6. 1 2 K. Gundy-Burlet; J. Schumann; T. Barrett; T. Menzies (2007). Parametric analysis of ANTARES re-entry guidance algorithms using advanced test generation and data analysis. In 9th International Symposium on Artificial Intelligence, Robotics and Automation in Space.
  7. 1 2 Gregory Gay; Tim Menzies; Misty Davies; Karen Gundy-Burlet (2010). "Automatically Finding the Control Variables for Complex System Behavior" (PDF). Automated Software Engineering. 17 (4).
  8. 1 2 T. Menzies; Y. Hu (2003). "Data Mining for Very Busy People" (PDF). IEEE Computer. 36 (11): 22–29. doi:10.1109/mc.2003.1244531.
  9. J. Schumann; K. Gundy-Burlet; C. Pasareanu; T. Menzies; A. Barrett (2009). Software V&V support by parametric analysis of large software simulation systems. Proceedings of the 2009 IEEE Aerospace Conference.