Ugly duckling theorem

Last updated November 15, 2024

The ugly duckling theorem is an argument showing that classification is not really possible without some sort of bias. More particularly, it assumes finitely many properties combinable by logical connectives, and finitely many objects; it asserts that any two different objects share the same number of (extensional) properties. The theorem is named after Hans Christian Andersen's 1843 story "The Ugly Duckling", because it shows that a duckling is just as similar to a swan as two swans are to each other. It was derived by Satosi Watanabe in 1969.^[1]^{: 376–377}

Mathematical formula

Suppose there are n things in the universe, and one wants to put them into classes or categories. One has no preconceived ideas or biases about what sorts of categories are "natural" or "normal" and what are not. So one has to consider all the possible classes that could be, all the possible ways of making a set out of the n objects. There are $2^{n}$ such ways, the size of the power set of n objects. One can use that to measure the similarity between two objects, and one would see how many sets they have in common. However, one cannot. Any two objects have exactly the same number of classes in common if we can form any possible class, namely $2^{n-1}$ (half the total number of classes there are). To see this is so, one may imagine each class is represented by an n-bit string (or binary encoded integer), with a zero for each element not in the class and a one for each element in the class. As one finds, there are $2^{n}$ such strings.

As all possible choices of zeros and ones are there, any two bit-positions will agree exactly half the time. One may pick two elements and reorder the bits so they are the first two, and imagine the numbers sorted lexicographically. The first $2^{n}/2$ numbers will have bit #1 set to zero, and the second $2^{n}/2$ will have it set to one. Within each of those blocks, the top $2^{n}/4$ will have bit #2 set to zero and the other $2^{n}/4$ will have it as one, so they agree on two blocks of $2^{n}/4$ or on half of all the cases, no matter which two elements one picks. So if we have no preconceived bias about which categories are better, everything is then equally similar (or equally dissimilar). The number of predicates simultaneously satisfied by two non-identical elements is constant over all such pairs. Thus, some kind of inductive^{[ citation needed ]} bias is needed to make judgements to prefer certain categories over others.

Boolean functions

Let $x_{1},x_{2},\dots ,x_{n}$ be a set of vectors of $k$ booleans each. The ugly duckling is the vector which is least like the others. Given the booleans, this can be computed using Hamming distance.

However, the choice of boolean features to consider could have been somewhat arbitrary. Perhaps there were features derivable from the original features that were important for identifying the ugly duckling. The set of booleans in the vector can be extended with new features computed as boolean functions of the $k$ original features. The only canonical way to do this is to extend it with all possible Boolean functions. The resulting completed vectors have $2^{k}$ features. The ugly duckling theorem states that there is no ugly duckling because any two completed vectors will either be equal or differ in exactly half of the features.

Proof. Let x and y be two vectors. If they are the same, then their completed vectors must also be the same because any Boolean function of x will agree with the same Boolean function of y. If x and y are different, then there exists a coordinate $i$ where the $i$ -th coordinate of $x$ differs from the $i$ -th coordinate of $y$ . Now the completed features contain every Boolean function on $k$ Boolean variables, with each one exactly once. Viewing these Boolean functions as polynomials in $k$ variables over GF(2), segregate the functions into pairs $(f,g)$ where $f$ contains the $i$ -th coordinate as a linear term and $g$ is $f$ without that linear term. Now, for every such pair $(f,g)$ , $x$ and $y$ will agree on exactly one of the two functions. If they agree on one, they must disagree on the other and vice versa. (This proof is believed to be due to Watanabe.)

Discussion

A possible way around the ugly duckling theorem would be to introduce a constraint on how similarity is measured by limiting the properties involved in classification, for instance, between A and B. However Medin et al. (1993) point out that this does not actually resolve the arbitrariness or bias problem since in what respects A is similar to B: "varies with the stimulus context and task, so that there is no unique answer, to the question of how similar is one object to another".^[3]^[5] For example, "a barberpole and a zebra would be more similar than a horse and a zebra if the feature striped had sufficient weight. Of course, if these feature weights were fixed, then these similarity relations would be constrained". Yet the property "striped" as a weight 'fix' or constraint is arbitrary itself, meaning: "unless one can specify such criteria, then the claim that categorization is based on attribute matching is almost entirely vacuous".

Stamos (2003) remarked that some judgments of overall similarity are non-arbitrary in the sense they are useful:

"Presumably, people's perceptual and conceptual processes have evolved that information that matters to human needs and goals can be roughly approximated by a similarity heuristic... If you are in the jungle and you see a tiger but you decide not to stereotype (perhaps because you believe that similarity is a false friend), then you will probably be eaten. In other words, in the biological world stereotyping based on veridical judgments of overall similarity statistically results in greater survival and reproductive success."^[6]

Unless some properties are considered more salient, or 'weighted' more important than others, everything will appear equally similar, hence Watanabe (1986) wrote: "any objects, in so far as they are distinguishable, are equally similar".^[7]

In a weaker setting that assumes infinitely many properties, Murphy and Medin (1985) give an example of two putative classified things, plums and lawnmowers:

"Suppose that one is to list the attributes that plums and lawnmowers have in common in order to judge their similarity. It is easy to see that the list could be infinite: Both weigh less than 10,000 kg (and less than 10,001 kg), both did not exist 10,000,000 years ago (and 10,000,001 years ago), both cannot hear well, both can be dropped, both take up space, and so on. Likewise, the list of differences could be infinite… any two entities can be arbitrarily similar or dissimilar by changing the criterion of what counts as a relevant attribute."^[8]

According to Woodward,^[9] the ugly duckling theorem is related to Schaffer's Conservation Law for Generalization Performance, which states that all algorithms for learning of boolean functions from input/output examples have the same overall generalization performance as random guessing.^[10] The latter result is generalized by Woodward to functions on countably infinite domains.^[11]

Notes

1 2 Satosi Watanabe (1969). Knowing and Guessing: A Quantitative Study of Inference and Information . New York: Wiley. ISBN 0-471-92130-0. LCCN 68-56165.
↑ Watanabe's x₁, x₂, x₃, y₁, and y₂, correspond to C, B, A, F, and W, respectively.
↑ Douglas L. Medin and R.L. Goldstone and Dedre Gentner (1993). "Respects for similarity". Psychological Review. 100 (2): 254–278. doi:10.1037/0033-295x.100.2.254.
↑ Nelson Goodman (1972). "Seven Strictures on Similarity". In Nelson Goodman (ed.). Problems and Projects. New York: Bobbs-Merrill. pp. 437–446.
↑ The philosopher Nelson Goodman ^[4] came to the same conclusion: "But importance is a highly volatile matter, varying with every shift of context and interest, and quite incapable of supporting the fixed distinctions that philosophers so often seek to rest upon it".
↑ Stamos, D. N. (2003). The Species Problem. Lexington Books. p. 344.
↑ Satosi Watanabe (1986). "Epistemological Relativity". Annals of the Japan Association for Philosophy of Science. 7 (1): 1–14. doi: 10.4288/jafpos1956.7.1 .
↑ Gregory L. Murphy and Douglas L. Medin (Jul 1985). "The Role of Theories in Conceptual Coherence" (PDF). Psychological Review. 92 (3): 289–316. doi:10.1037/0033-295x.92.3.289. PMID 4023146.
↑ John R. Woodward (Nov 2009). "Computable and Incomputable Functions and Search Algorithms" (PDF). International Conference on Intelligent Computing and Intelligent Systems. IEEE. pp. 871–875. doi:10.1109/ICICISYS.2009.5358045. ISBN 978-1-4244-4754-1. S2CID 14473304. Here: p. 874 lf
↑ Cullen Schaffer (1994). "A conservation law for generalization performance" (PDF). In Willian, H.; Cohen, W. (eds.). Proceedings of the 1994 International Conference on Machine Learning (San Mateo/CA). Morgan Kaufmann. pp. 259–265. Archived from the original (PDF) on 2021-04-12. Retrieved 2021-04-25. Here p. 260 lf
↑ Woodward (2009), p. 875 lf

Related Research Articles

<span class="mw-page-title-main">Inner product space</span> Generalization of the dot product; used to define Hilbert spaces

In mathematics, an inner product space is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often denoted with angle brackets such as in $. Inner products allow formal definitions of intuitive geometric notions, such as lengths, angles, and orthogonality of vectors. Inner product spaces generalize Euclidean vector spaces, in which the inner product is the dot product or scalar product of Cartesian coordinates. Inner product spaces of infinite dimension are widely used in functional analysis. Inner product spaces over the field of complex numbers are sometimes referred to as unitary spaces . The first usage of the concept of a vector space with an inner product is due to Giuseppe Peano, in 1898.$

<span class="mw-page-title-main">Supervised learning</span> Paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data to expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

In mathematics, the power set (or powerset) of a set $S$ is the set of all subsets of $S$ , including the empty set and $S$ itself. In axiomatic set theory (as developed, for example, in the ZFC axioms), the existence of the power set of any set is postulated by the axiom of power set. The powerset of $S$ is variously denoted as $P (S)$ , $𝒫 (S)$ , $P (S)$ , $, or 2 S . Any subset of P (S) is called a family of sets over S .$

<span class="mw-page-title-main">Similarity (geometry)</span> Property of objects which are scaled or mirrored versions of each other

In Euclidean geometry, two objects are similar if they have the same shape, or if one has the same shape as the mirror image of the other. More precisely, one can be obtained from the other by uniformly scaling, possibly with additional translation, rotation and reflection. This means that either object can be rescaled, repositioned, and reflected, so as to coincide precisely with the other object. If two objects are similar, each is congruent to the result of a particular uniform scaling of the other.

In mathematics, a complete lattice is a partially ordered set in which all subsets have both a supremum (join) and an infimum (meet). A conditionally complete lattice satisfies at least one of these properties for bounded subsets. For comparison, in a general lattice, only pairs of elements need to have a supremum and an infimum. Every non-empty finite lattice is complete, but infinite lattices may be incomplete.

In abstract algebra, the direct sum is a construction which combines several modules into a new, larger module. The direct sum of modules is the smallest module which contains the given modules as submodules with no "unnecessary" constraints, making it an example of a coproduct. Contrast with the direct product, which is the dual notion.

A mathematical symbol is a figure or a combination of figures that is used to represent a mathematical object, an action on mathematical objects, a relation between mathematical objects, or for structuring the other symbols that occur in a formula. As formulas are entirely constituted with symbols of various types, many symbols are needed for expressing all mathematics.

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

In mathematics, differential forms provide a unified approach to define integrands over curves, surfaces, solids, and higher-dimensional manifolds. The modern notion of differential forms was pioneered by Élie Cartan. It has many applications, especially in geometry, topology and physics.

In mathematics, Frobenius' theorem gives necessary and sufficient conditions for finding a maximal set of independent solutions of an overdetermined system of first-order homogeneous linear partial differential equations. In modern geometric terms, given a family of vector fields, the theorem gives necessary and sufficient integrability conditions for the existence of a foliation by maximal integral manifolds whose tangent bundles are spanned by the given vector fields. The theorem generalizes the existence theorem for ordinary differential equations, which guarantees that a single vector field always gives rise to integral curves; Frobenius gives compatibility conditions under which the integral curves of r vector fields mesh into coordinate grids on r-dimensional integral manifolds. The theorem is foundational in differential topology and calculus on manifolds.

Multivariable calculus is the extension of calculus in one variable to calculus with functions of several variables: the differentiation and integration of functions involving multiple variables (multivariate), rather than just one.

<span class="mw-page-title-main">Boolean function</span> Function returning one of only two values

In mathematics, a Boolean function is a function whose arguments and result assume values from a two-element set. Alternative names are switching function, used especially in older computer science literature, and truth function, used in logic. Boolean functions are the subject of Boolean algebra and switching theory.

In mathematics, a field of sets is a mathematical structure consisting of a pair $consisting of a set and a family of subsets of called an algebra over that contains the empty set as an element, and is closed under the operations of taking complements in finite unions, and finite intersections.$

In universal algebra, a variety of algebras or equational class is the class of all algebraic structures of a given signature satisfying a given set of identities. For example, the groups form a variety of algebras, as do the abelian groups, the rings, the monoids etc. According to Birkhoff's theorem, a class of algebraic structures of the same signature is a variety if and only if it is closed under the taking of homomorphic images, subalgebras, and (direct) products. In the context of category theory, a variety of algebras, together with its homomorphisms, forms a category; these are usually called finitary algebraic categories.

The language of mathematics has a wide vocabulary of specialist and technical terms. It also has a certain amount of jargon: commonly used phrases which are part of the culture of mathematics, rather than of the subject. Jargon often appears in lectures, and sometimes in print, as informal shorthand for rigorous arguments or precise ideas. Much of this uses common English words, but with a specific non-obvious meaning when used in a mathematical sense.

The gradient theorem, also known as the fundamental theorem of calculus for line integrals, says that a line integral through a gradient field can be evaluated by evaluating the original scalar field at the endpoints of the curve. The theorem is a generalization of the second fundamental theorem of calculus to any curve in a plane or space rather than just the real line.

Boolean algebra is a mathematically rich branch of abstract algebra. Stanford Encyclopaedia of Philosophy defines Boolean algebra as 'the algebra of two-valued logic with only sentential connectives, or equivalently of algebras of sets under union and complementation.' Just as group theory deals with groups, and linear algebra with vector spaces, Boolean algebras are models of the equational theory of the two values 0 and 1. Common to Boolean algebras, groups, and vector spaces is the notion of an algebraic structure, a set closed under some operations satisfying certain equations.

In mathematics, Hilbert spaces allow the methods of linear algebra and calculus to be generalized from (finite-dimensional) Euclidean vector spaces to spaces that may be infinite-dimensional. Hilbert spaces arise naturally and frequently in mathematics and physics, typically as function spaces. Formally, a Hilbert space is a vector space equipped with an inner product that induces a distance function for which the space is a complete metric space. A Hilbert space is a special case of a Banach space.

In mathematics, a symmetric Boolean function is a Boolean function whose value does not depend on the order of its input bits, i.e., it depends only on the number of ones in the input. For this reason they are also known as Boolean counting functions.

In mathematics, the dyadic cubes are a collection of cubes in Rⁿ of different sizes or scales such that the set of cubes of each scale partition Rⁿ and each cube in one scale may be written as a union of cubes of a smaller scale. These are frequently used in mathematics as a way of discretizing objects in order to make computations or analysis easier. For example, to study an arbitrary subset of A of Euclidean space, one may instead replace it by a union of dyadic cubes of a particular size that cover the set. One can consider this set as a pixelized version of the original set, and as smaller cubes are used one gets a clearer image of the set A. Most notable appearances of dyadic cubes include the Whitney extension theorem and the Calderón–Zygmund lemma.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Watanabe.1969-1] 1 2 Satosi Watanabe (1969). Knowing and Guessing: A Quantitative Study of Inference and Information . New York: Wiley. ISBN 0-471-92130-0. LCCN 68-56165.

[2] Watanabe's x₁, x₂, x₃, y₁, and y₂, correspond to C, B, A, F, and W, respectively.

[3] Douglas L. Medin and R.L. Goldstone and Dedre Gentner (1993). "Respects for similarity". Psychological Review. 100 (2): 254–278. doi:10.1037/0033-295x.100.2.254.

[4] Nelson Goodman (1972). "Seven Strictures on Similarity". In Nelson Goodman (ed.). Problems and Projects. New York: Bobbs-Merrill. pp. 437–446.

[5] The philosopher Nelson Goodman ^[4] came to the same conclusion: "But importance is a highly volatile matter, varying with every shift of context and interest, and quite incapable of supporting the fixed distinctions that philosophers so often seek to rest upon it".

[6] Stamos, D. N. (2003). The Species Problem. Lexington Books. p. 344.

[7] Satosi Watanabe (1986). "Epistemological Relativity". Annals of the Japan Association for Philosophy of Science. 7 (1): 1–14. doi: 10.4288/jafpos1956.7.1 .

[8] Gregory L. Murphy and Douglas L. Medin (Jul 1985). "The Role of Theories in Conceptual Coherence" (PDF). Psychological Review. 92 (3): 289–316. doi:10.1037/0033-295x.92.3.289. PMID 4023146.

[Woodward.2009-9] John R. Woodward (Nov 2009). "Computable and Incomputable Functions and Search Algorithms" (PDF). International Conference on Intelligent Computing and Intelligent Systems. IEEE. pp. 871–875. doi:10.1109/ICICISYS.2009.5358045. ISBN 978-1-4244-4754-1. S2CID 14473304. Here: p. 874 lf

[10] Cullen Schaffer (1994). "A conservation law for generalization performance" (PDF). In Willian, H.; Cohen, W. (eds.). Proceedings of the 1994 International Conference on Machine Learning (San Mateo/CA). Morgan Kaufmann. pp. 259–265. Archived from the original (PDF) on 2021-04-12. Retrieved 2021-04-25. Here p. 260 lf

[11] Woodward (2009), p. 875 lf

[1]

[3]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[4]

v t e Hans Christian Andersen's "The Ugly Duckling" (1843)
Films	The Ugly Duckling (1931) The Ugly Duckling (1939) Downhearted Duckling (1953) The Ugly Duckling (1956 Russian) The Daydreamer (1966) The Ugly Duckling and Me! (2006)
Other	Honk! (1993 musical) Ugly Duckling (2009 play) The Ugly Duckling (audiobook) Ugly duckling theorem Ugly Duckling fountain (fountain)