Information gain ratio

Last updated

In decision tree learning, information gain ratio is a ratio of information gain to the intrinsic information. It was proposed by Ross Quinlan, [1] to reduce a bias towards multi-valued attributes by taking the number and size of branches into account when choosing an attribute. [2]

Contents

Information gain is also known as mutual information. [3]

The image shows the information gain of a variable called "year" and shows the result of choosing a year 1 through 12. The information gain would favor this variable as the results would either be definitely positive or negative while also creating multiple leaf nodes, however, the problem is that none of these years will occur again. The next input would be year 13, but there is no branch to year 13 and that is a problem that can be solved with information gain ratio. Information gain ratio will normalize the data using the entropy value of that variable to remove the bias of multi-variable data and variables with multiple nodes compared to variables with a smaller set of nodes. This would remove the odds of the tree in the image from being created.
Information Gain.png

Information gain calculation

Information gain is the reduction in entropy produced from partitioning a set with attributes and finding the optimal candidate that produces the highest value:

where is a random variable and is the entropy of given the value of attribute .

The information gain is equal to the total entropy for an attribute if for each of the attribute values a unique classification can be made for the result attribute. In this case the relative entropies subtracted from the total entropy are 0.

Split information calculation

The split information value for a test is defined as follows:

where is a discrete random variable with possible values and being the number of times that occurs divided by the total count of events where is the set of events.

The split information value is a positive number that describes the potential worth of splitting a branch from a node. This in turn is the intrinsic value that the random variable possesses and will be used to remove the bias in the information gain ratio calculation.

Information gain ratio calculation

The information gain ratio is the ratio between the information gain and the split information value:

Example

Using weather data published by Fordham University, [4] the table was created below:

WEKA weather data
OutlookTemperatureHumidityWindPlay
SunnyHotHighFalseNo
SunnyHotHighTrueNo
OvercastHotHighFalseYes
RainyMildHighFalseYes
RainyCoolNormalFalseYes
RainyCoolNormalTrueNo
OvercastCoolNormalTrueYes
SunnyMildHighFalseNo
SunnyCoolNormalFalseYes
RainyMildNormalFalseYes
SunnyMildNormalFalseYes
OvercastMildHighTrueYes
OvercastHotNormalFalseYes
RainyMildHighTrueNo

Using the table above, one can find the entropy, information gain, split information, and information gain ratio for each variable (outlook, temperature, humidity, and wind). These calculations are shown in the tables below:

Outlook table
OutlookYesNoCount of each groupEntropy
sunny2350.971
overcast4040.000
rainy3250.971
ResultsValues
Information0.694
Overall entropy0.940
Information gain0.247
Split information1.577
Gain ratio0.156
Temperature table
TemperatureYesNoCount of each groupEntropy
hot2241.000
mild4260.918
cool3140.811
ResultsValues
Information0.911
Overall entropy0.940
Information gain0.029
Split information1.557
Gain ratio0.019
Wind table
WindYesNoCount of each groupEntropy
False6280.811
True3361.000
ResultsValues
Information0.892
Overall entropy0.940
Information gain0.048
Split information0.985
Gain ratio0.049
Humidity table
HumidityYesNoCount of each groupEntropy
High3470.985
Normal6170.592
ResultsValues
Information0.788
Overall entropy0.940
Information gain0.152
Split information1.000
Gain ratio0.152

Using the above tables, one can deduce that Outlook has the highest information gain ratio. Next, one must find the statistics for the sub-groups of the Outlook variable (sunny, overcast, and rainy), for this example one will only build the sunny branch (as shown in the table below):

Outlook table
OutlookTemperatureHumidityWindPlay
SunnyHotHighFalseNo
SunnyHotHighTrueNo
SunnyMildHighFalseNo
SunnyCoolNormalFalseYes
SunnyMildNormalTrueYes

One can find the following statistics for the other variables (temperature, humidity, and wind) to see which have the greatest effect on the sunny element of the outlook variable:

Temperature table
TemperatureYesNoCount of each groupEntropy
Hot0220.000
Mild1121.000
Cool1010.000
ResultsValues
Information0.400
Overall entropy0.971
Gain0.571
Split information1.522
Gain ratio0.375
Wind table
WindYesNoCount of each groupEntropy
False1230.918
True1121.000
ResultsValues
Information0.951
Overall entropy0.971
Gain0.020
Split information0.971
Gain ratio0.021
Humidity table
HumidityYesNoCount of each groupEntropy
High0330.000
Normal2020.000
ResultsValues
Information0.000
Overall entropy0.971
Gain0.971
Split information0.971
Gain ratio1.000

Humidity was found to have the highest information gain ratio. One will repeat the same steps as before and find the statistics for the events of the Humidity variable (high and normal):

Humidity-high Table
HumidityWindPlay
HighFalseNo
HighTrueNo
HighFalseNo
Humidity-normal Table
HumidityWindPlay
NormalFalseYes
NormalTrueYes

Since the play values are either all "No" or "Yes", the information gain ratio value will be equal to 1. Also, now that one has reached the end of the variable chain with Wind being the last variable left, they can build an entire root to leaf node branch line of a decision tree.

Outlook Sunny branch decision tree.drawio.png

Once finished with reaching this leaf node, one would follow the same procedure for the rest of the elements that have yet to be split in the decision tree. This set of data was relatively small, however, if a larger set was used, the advantages of using the information gain ratio as the splitting factor of a decision tree can be seen more.

Advantages

Information gain ratio biases the decision tree against considering attributes with a large number of distinct values.

For example, suppose that we are building a decision tree for some data describing a business's customers. Information gain ratio is used to decide which of the attributes are the most relevant. These will be tested near the root of the tree. One of the input attributes might be the customer's telephone number. This attribute has a high information gain, because it uniquely identifies each customer. Due to its high amount of distinct values, this will not be chosen to be tested near the root.

Disadvantages

Although information gain ratio solves the key problem of information gain, it creates another problem. If one is considering an amount of attributes that have a high number of distinct values, these will never be above one that has a lower number of distinct values.

Difference from information gain

Situational differences between information gain and information gain ratio
Information gainInformation gain ratio
Will not favor any attributes by number of distinct valuesWill favor attribute that have a lower number of distinct values
When applied to attributes that can take on a large number of distinct values, this technique might learn the training set too wellUser will struggle if required to find attributes requiring a high number of distinct values

See also

Related Research Articles

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to , the entropy is

<span class="mw-page-title-main">Beta distribution</span> Probability distribution

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.

<span class="mw-page-title-main">Helmholtz free energy</span> Thermodynamic potential

In thermodynamics, the Helmholtz free energy is a thermodynamic potential that measures the useful work obtainable from a closed thermodynamic system at a constant temperature (isothermal). The change in the Helmholtz energy during a process is equal to the maximum amount of work that the system can perform in a thermodynamic process in which temperature is held constant. At constant temperature, the Helmholtz free energy is minimized at equilibrium.

<span class="mw-page-title-main">Quantization (signal processing)</span> Process of mapping a continuous set to a countable set

Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set to output values in a (countable) smaller set, often with a finite number of elements. Rounding and truncation are typical examples of quantization processes. Quantization is involved to some degree in nearly all digital signal processing, as the process of representing a signal in digital form ordinarily involves rounding. Quantization also forms the core of essentially all lossy compression algorithms.

<span class="mw-page-title-main">Mutual information</span> Measure of dependence between two variables

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, i.e., a smooth manifold whose points are probability measures defined on a common probability space. It can be used to calculate the informational difference between measurements.

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

Quantum statistical mechanics is statistical mechanics applied to quantum mechanical systems. In quantum mechanics a statistical ensemble is described by a density operator S, which is a non-negative, self-adjoint, trace-class operator of trace 1 on the Hilbert space H describing the quantum system. This can be shown under various mathematical formalisms for quantum mechanics.

<span class="mw-page-title-main">Conditional entropy</span> Measure of relative information in probability theory

In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable given that the value of another random variable is known. Here, information is measured in shannons, nats, or hartleys. The entropy of conditioned on is written as .

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

<span class="mw-page-title-main">ID3 algorithm</span> Decision tree algorithm

In decision tree learning, ID3 is an algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used in the machine learning and natural language processing domain

In information theory and machine learning, information gain is a synonym for Kullback–Leibler divergence; the amount of information gained about a random variable or signal from observing another random variable. However, in the context of decision trees, the term is sometimes used synonymously with mutual information, which is the conditional expected value of the Kullback–Leibler divergence of the univariate probability distribution of one variable from the conditional distribution of this variable given the other one.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

<span class="mw-page-title-main">Binary entropy function</span>

In information theory, the binary entropy function, denoted or , is defined as the entropy of a Bernoulli process with probability of one of two values. It is a special case of , the entropy function. Mathematically, the Bernoulli trial is modelled as a random variable that can take on only two values: 0 and 1, which are mutually exclusive and exhaustive.

<span class="mw-page-title-main">Fundamental thermodynamic relation</span>

In thermodynamics, the fundamental thermodynamic relation are four fundamental equations which demonstrate how four important thermodynamic quantities depend on variables that can be controlled and measured experimentally. Thus, they are essentially equations of state, and using the fundamental equations, experimental data can be used to determine sought-after quantities like G or H (enthalpy). The relation is generally expressed as a microscopic change in internal energy in terms of microscopic changes in entropy, and volume for a closed system in thermal equilibrium in the following way.

<span class="mw-page-title-main">Quantities of information</span>

The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm. Although "bit" is more frequently used in place of "shannon", its name is not distinguished from the bit as used in data-processing to refer to a binary value or stream regardless of its entropy Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.

In information theory, the limiting density of discrete points is an adjustment to the formula of Claude Shannon for differential entropy.

References

  1. Quinlan, J. Ross. "Induction of decision trees." Machine learning 1.1 (1986): 81-106.
  2. http://www.ke.tu-darmstadt.de/lehre/archiv/ws0809/mldm/dt.pdf [ bare URL PDF ]
  3. "Information gain, mutual information and related measures".
  4. https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.nominal.arff