Category utility

Last updated

Category utility is a measure of "category goodness" defined in Gluck & Corter (1985) and Corter & Gluck (1992). It attempts to maximize both the probability that two objects in the same category have attribute values in common, and the probability that objects from different categories have different attribute values. It was intended to supersede more limited measures of category goodness such as "cue validity" (Reed 1972; Rosch & Mervis 1975) and "collocation index" ( Jones 1983 ). It provides a normative information-theoretic measure of the predictive advantage gained by the observer who possesses knowledge of the given category structure (i.e., the class labels of instances) over the observer who does not possess knowledge of the category structure. In this sense the motivation for the category utility measure is similar to the information gain metric used in decision tree learning. In certain presentations, it is also formally equivalent to the mutual information, as discussed below. A review of category utility in its probabilistic incarnation, with applications to machine learning, is provided in Witten & Frank (2005 , pp. 260–262).

Contents

Probability-theoretic definition of category utility

The probability-theoretic definition of category utility given in Fisher (1987) and Witten & Frank (2005) is as follows:

where is a size- set of -ary features, and is a set of categories. The term designates the marginal probability that feature takes on value , and the term designates the category-conditional probability that feature takes on value given that the object in question belongs to category .

The motivation and development of this expression for category utility, and the role of the multiplicand as a crude overfitting control, is given in the above sources. Loosely ( Fisher 1987 ), the term is the expected number of attribute values that can be correctly guessed by an observer using a probability-matching strategy together with knowledge of the category labels, while is the expected number of attribute values that can be correctly guessed by an observer the same strategy but without any knowledge of the category labels. Their difference therefore reflects the relative advantage accruing to the observer by having knowledge of the category structure.

Information-theoretic definition of category utility

The information-theoretic definition of category utility for a set of entities with size- binary feature set , and a binary category is given in Gluck & Corter (1985) as follows:

where is the prior probability of an entity belonging to the positive category (in the absence of any feature information), is the conditional probability of an entity having feature given that the entity belongs to category , is likewise the conditional probability of an entity having feature given that the entity belongs to category , and is the prior probability of an entity possessing feature (in the absence of any category information).

The intuition behind the above expression is as follows: The term represents the cost (in bits) of optimally encoding (or transmitting) feature information when it is known that the objects to be described belong to category . Similarly, the term represents the cost (in bits) of optimally encoding (or transmitting) feature information when it is known that the objects to be described belong to category . The sum of these two terms in the brackets is therefore the weighted average of these two costs. The final term, , represents the cost (in bits) of optimally encoding (or transmitting) feature information when no category information is available. The value of the category utility will, in the above formulation, be negative (???).

Category utility and mutual information

Gluck & Corter (1985) and Corter & Gluck (1992) mention that the category utility is equivalent to the mutual information. Here is a simple demonstration of the nature of this equivalence. Assume a set of entities each having the same features, i.e., feature set , with each feature variable having cardinality . That is, each feature has the capacity to adopt any of distinct values (which need not be ordered; all variables can be nominal); for the special case these features would be considered binary, but more generally, for any , the features are simply m-ary. For the purposes of this demonstration, without loss of generality, feature set can be replaced with a single aggregate variable that has cardinality , and adopts a unique value corresponding to each feature combination in the Cartesian product . (Ordinality does not matter, because the mutual information is not sensitive to ordinality.) In what follows, a term such as or simply refers to the probability with which adopts the particular value . (Using the aggregate feature variable replaces multiple summations, and simplifies the presentation to follow.)

For this demonstration, also assume a single category variable , which has cardinality . This is equivalent to a classification system in which there are non-intersecting categories. In the special case of there are the two-category case discussed above. From the definition of mutual information for discrete variables, the mutual information between the aggregate feature variable and the category variable is given by:

where is the prior probability of feature variable adopting value , is the marginal probability of category variable adopting value , and is the joint probability of variables and simultaneously adopting those respective values. In terms of the conditional probabilities this can be re-written (or defined) as

If the original definition of the category utility from above is rewritten with ,

This equation clearly has the same form as the (blue) equation expressing the mutual information between the feature set and the category variable; the difference is that the sum in the category utility equation runs over independent binary variables , whereas the sum in the mutual information runs over values of the single -ary variable . The two measures are actually equivalent then only when the features , are independent (and assuming that terms in the sum corresponding to are also added).

Insensitivity of category utility to ordinality

Like the mutual information, the category utility is not sensitive to any ordering in the feature or category variable values. That is, as far as the category utility is concerned, the category set {small,medium,large,jumbo} is not qualitatively different from the category set {desk,fish,tree,mop} since the formulation of the category utility does not account for any ordering of the class variable. Similarly, a feature variable adopting values {1,2,3,4,5} is not qualitatively different from a feature variable adopting values {fred,joe,bob,sue,elaine}. As far as the category utility or mutual information are concerned, all category and feature variables are nominal variables. For this reason, category utility does not reflect any gestalt aspects of "category goodness" that might be based on such ordering effects. One possible adjustment for this insensitivity to ordinality is given by the weighting scheme described in the article for mutual information.

Category "goodness": models and philosophy

This section provides some background on the origins of, and need for, formal measures of "category goodness" such as the category utility, and some of the history that lead to the development of this particular metric.

What makes a good category?

At least since the time of Aristotle there has been a tremendous fascination in philosophy with the nature of concepts and universals. What kind of entity is a concept such as "horse"? Such abstractions do not designate any particular individual in the world, and yet we can scarcely imagine being able to comprehend the world without their use. Does the concept "horse" therefore have an independent existence outside of the mind? If it does, then what is the locus of this independent existence? The question of locus was an important issue on which the classical schools of Plato and Aristotle famously differed. However, they remained in agreement that universals did indeed have a mind-independent existence. There was, therefore, always a fact to the matter about which concepts and universals exist in the world.

In the late Middle Ages (perhaps beginning with Occam, although Porphyry also makes a much earlier remark indicating a certain discomfort with the status quo), however, the certainty that existed on this issue began to erode, and it became acceptable among the so-called nominalists and empiricists to consider concepts and universals as strictly mental entities or conventions of language. On this view of concepts—that they are purely representational constructs—a new question then comes to the fore: "Why do we possess one set of concepts rather than another?" What makes one set of concepts "good" and another set of concepts "bad"? This is a question that modern philosophers, and subsequently machine learning theorists and cognitive scientists, have struggled with for many decades.

What purpose do concepts serve?

One approach to answering such questions is to investigate the "role" or "purpose" of concepts in cognition. Thus the answer to "What are concepts good for in the first place?" by Mill (1843 , p. 425) and many others is that classification (conception) is a precursor to induction : By imposing a particular categorization on the universe, an organism gains the ability to deal with physically non-identical objects or situations in an identical fashion, thereby gaining substantial predictive leverage (Smith & Medin 1981; Harnad 2005). As J.S. Mill puts it ( Mill 1843 , pp. 466–468),

The general problem of classification... [is] to provide that things shall be thought of in such groups, and those groups in such an order, as will best conduce to the remembrance and to the ascertainment of their laws... [and] one of the uses of such a classification that by drawing attention to the properties on which it is founded, and which, if the classification be good, are marks of many others, it facilitates the discovery of those others.

From this base, Mill reaches the following conclusion, which foreshadows much subsequent thinking about category goodness, including the notion of category utility:

The ends of scientific classification are best answered when the objects are formed into groups respecting which a greater number of general propositions can be made, and those propositions more important, than could be made respecting any other groups into which the same things could be distributed. The properties, therefore, according to which objects are classified should, if possible, be those which are causes of many other properties; or, at any rate, which are sure marks of them.

One may compare this to the "category utility hypothesis" proposed by Corter & Gluck (1992): "A category is useful to the extent that it can be expected to improve the ability of a person to accurately predict the features of instances of that category." Mill here seems to be suggesting that the best category structure is one in which object features (properties) are maximally informative about the object's class, and, simultaneously, the object class is maximally informative about the object's features. In other words, a useful classification scheme is one in which category knowledge can be used to accurately infer object properties, and property knowledge can be used to accurately infer object classes. One may also compare this idea to Aristotle's criterion of counter-predication for definitional predicates, as well as to the notion of concepts described in formal concept analysis.

Attempts at formalization

A variety of different measures have been suggested with an aim of formally capturing this notion of "category goodness," the best known of which is probably the "cue validity". Cue validity of a feature with respect to category is defined as the conditional probability of the category given the feature (Reed 1972;Rosch & Mervis 1975;Rosch 1978), , or as the deviation of the conditional probability from the category base rate (Edgell 1993;Kruschke & Johansen 1999), . Clearly, these measures quantify only inference from feature to category (i.e., cue validity), but not from category to feature, i.e., the category validity. Also, while the cue validity was originally intended to account for the demonstrable appearance of basic categories in human cognition—categories of a particular level of generality that are evidently preferred by human learners—a number of major flaws in the cue validity quickly emerged in this regard (Jones 1983;Murphy 1982;Corter & Gluck 1992, and others).

One attempt to address both problems by simultaneously maximizing both feature validity and category validity was made by Jones (1983) in defining the "collocation index" as the product , but this construction was fairly ad hoc (see Corter & Gluck 1992). The category utility was introduced as a more sophisticated refinement of the cue validity, which attempts to more rigorously quantify the full inferential power of a class structure. As shown above, on a certain view the category utility is equivalent to the mutual information between the feature variable and the category variable. It has been suggested that categories having the greatest overall category utility are those that are not only those "best" in a normative sense, but also those human learners prefer to use, e.g., "basic" categories ( Corter & Gluck 1992 ). Other related measures of category goodness are "cohesion" (Hanson & Bauer 1989;Gennari, Langley & Fisher 1989) and "salience" ( Gennari 1989 ).

Applications

See also

Related Research Articles

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to :

In probability theory, the central limit theorem (CLT) establishes that, in many situations, for identically distributed independent samples, the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.

In mathematics, a power series is an infinite series of the form

A mathematical symbol is a figure or a combination of figures that is used to represent a mathematical object, an action on mathematical objects, a relation between mathematical objects, or for structuring the other symbols that occur in a formula. As formulas are entirely constituted with symbols of various types, many symbols are needed for expressing all mathematics.

<span class="mw-page-title-main">Law of large numbers</span> Averages of repeated trials converge to the expected value

In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and tends to become closer to the expected value as more trials are performed.

<span class="mw-page-title-main">Hypergeometric distribution</span> Discrete probability distribution

In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the probability of successes in draws, without replacement, from a finite population of size that contains exactly objects with that feature, wherein each draw is either a success or a failure. In contrast, the binomial distribution describes the probability of successes in draws with replacement.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, i.e., a smooth manifold whose points are probability measures defined on a common probability space. It can be used to calculate the informational difference between measurements.

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

The information bottleneck method is a technique in information theory introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek. It is designed for finding the best tradeoff between accuracy and complexity (compression) when summarizing a random variable X, given a joint probability distribution p(X,Y) between X and an observed relevant variable Y - and self-described as providing "a surprisingly rich framework for discussing a variety of problems in signal processing and learning".

<span class="mw-page-title-main">Dirichlet distribution</span> Probability distribution

In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted , is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact, the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.

The algebra of random variables in statistics, provides rules for the symbolic manipulation of random variables, while avoiding delving too deeply into the mathematically sophisticated ideas of probability theory. Its symbolism allows the treatment of sums, products, ratios and general functions of random variables, as well as dealing with operations such as finding the probability distributions and the expectations, variances and covariances of such combinations.

<span class="mw-page-title-main">Markov random field</span>

In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. In other words, a random field is said to be a Markov random field if it satisfies Markov properties. The concept originates from the Sherrington–Kirkpatrick model.

Lottery mathematics is used to calculate probabilities of winning or losing a lottery game. It is based primarily on combinatorics, particularly the twelvefold way and combinations without replacement.

In probability theory and statistics, a categorical distribution is a discrete probability distribution that describes the possible results of a random variable that can take on one of K possible categories, with the probability of each category separately specified. There is no innate underlying ordering of these outcomes, but numerical labels are often attached for convenience in describing the distribution,. The K-dimensional categorical distribution is the most general distribution over a K-way event; any other discrete distribution over a size-K sample space is a special case. The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1.

The partition function or configuration integral, as used in probability theory, information theory and dynamical systems, is a generalization of the definition of a partition function in statistical mechanics. It is a special case of a normalizing constant in probability theory, for the Boltzmann distribution. The partition function occurs in many problems of probability theory because, in situations where there is a natural symmetry, its associated probability measure, the Gibbs measure, has the Markov property. This means that the partition function occurs not only in physical systems with translation symmetry, but also in such varied settings as neural networks, and applications such as genomics, corpus linguistics and artificial intelligence, which employ Markov networks, and Markov logic networks. The Gibbs measure is also the unique measure that has the property of maximizing the entropy for a fixed expectation value of the energy; this underlies the appearance of the partition function in maximum entropy methods and the algorithms derived therefrom.

In probability theory and statistics, the Poisson binomial distribution is the discrete probability distribution of a sum of independent Bernoulli trials that are not necessarily identically distributed. The concept is named after Siméon Denis Poisson.

Geometric feature learning is a technique combining machine learning and computer vision to solve visual tasks. The main goal of this method is to find a set of representative features of geometric form to represent an object by collecting geometric features from images and learning them using efficient machine learning methods. Humans solve visual tasks and can give fast response to the environment by extracting perceptual information from what they see. Researchers simulate humans' ability of recognizing objects to solve computer vision problems. For example, M. Mata et al.(2002) applied feature learning techniques to the mobile robot navigation tasks in order to avoid obstacles. They used genetic algorithms for learning features and recognizing objects (figures). Geometric feature learning methods can not only solve recognition problems but also predict subsequent actions by analyzing a set of sequential input sensory images, usually some extracting features of images. Through learning, some hypothesis of the next action are given and according to the probability of each hypothesis give a most probable action. This technique is widely used in the area of artificial intelligence.

The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 and it was inspired from the PAC-framework introduced by Leslie Valiant.

References