Gambling and information theory

Last updated

Statistical inference might be thought of as gambling theory applied to the world around us. The myriad applications for logarithmic information measures tell us precisely how to take the best guess in the face of partial information. [1] In that sense, information theory might be considered a formal expression of the theory of gambling. It is no surprise, therefore, that information theory has applications to games of chance. [2]

Contents

Kelly Betting

Kelly betting or proportional betting is an application of information theory to investing and gambling. Its discoverer was John Larry Kelly, Jr.

Part of Kelly's insight was to have the gambler maximize the expectation of the logarithm of his capital, rather than the expected profit from each bet. This is important, since in the latter case, one would be led to gamble all he had when presented with a favorable bet, and if he lost, would have no capital with which to place subsequent bets. Kelly realized that it was the logarithm of the gambler's capital which is additive in sequential bets, and "to which the law of large numbers applies."

Side information

A bit is the amount of entropy in a bettable event with two possible outcomes and even odds. Obviously we could double our money if we knew beforehand for certain what the outcome of that event would be. Kelly's insight was that no matter how complicated the betting scenario is, we can use an optimum betting strategy, called the Kelly criterion, to make our money grow exponentially with whatever side information we are able to obtain. The value of this "illicit" side information is measured as mutual information relative to the outcome of the betable event:

where Y is the side information, X is the outcome of the betable event, and I is the state of the bookmaker's knowledge. This is the average Kullback–Leibler divergence, or information gain, of the a posteriori probability distribution of X given the value of Y relative to the a priori distribution, or stated odds, on X. Notice that the expectation is taken over Y rather than X: we need to evaluate how accurate, in the long term, our side information Y is before we start betting real money on X. This is a straightforward application of Bayesian inference. Note that the side information Y might affect not just our knowledge of the event X but also the event itself. For example, Y might be a horse that had too many oats or not enough water. The same mathematics applies in this case, because from the bookmaker's point of view, the occasional race fixing is already taken into account when he makes his odds.

The nature of side information is extremely finicky. We have already seen that it can affect the actual event as well as our knowledge of the outcome. Suppose we have an informer, who tells us that a certain horse is going to win. We certainly do not want to bet all our money on that horse just upon a rumor: that informer may be betting on another horse, and may be spreading rumors just so he can get better odds himself. Instead, as we have indicated, we need to evaluate our side information in the long term to see how it correlates with the outcomes of the races. This way we can determine exactly how reliable our informer is, and place our bets precisely to maximize the expected logarithm of our capital according to the Kelly criterion. Even if our informer is lying to us, we can still profit from his lies if we can find some reverse correlation between his tips and the actual race results.

Doubling rate

Doubling rate in gambling on a horse race is [3]

where there are horses, the probability of the th horse winning being , the proportion of wealth bet on the horse being , and the odds (payoff) being (e.g., if the th horse winning pays double the amount bet). This quantity is maximized by proportional (Kelly) gambling:

for which

where is information entropy.

Expected gains

An important but simple relation exists between the amount of side information a gambler obtains and the expected exponential growth of his capital (Kelly):

for an optimal betting strategy, where is the initial capital, is the capital after the tth bet, and is the amount of side information obtained concerning the ith bet (in particular, the mutual information relative to the outcome of each beatable event). This equation applies in the absence of any transaction costs or minimum bets. When these constraints apply (as they invariably do in real life), another important gambling concept comes into play: the gambler (or unscrupulous investor) must face a certain probability of ultimate ruin, which is known as the gambler's ruin scenario. Note that even food, clothing, and shelter can be considered fixed transaction costs and thus contribute to the gambler's probability of ultimate ruin.

This equation was the first application of Shannon's theory of information outside its prevailing paradigm of data communications (Pierce).

Applications for self-information

Surprisal and evidence in bits, as logarithmic measures of probability and odds respectively. Sandebits.png
Surprisal and evidence in bits, as logarithmic measures of probability and odds respectively.

The logarithmic probability measure self-information or surprisal, [4] whose average is information entropy/uncertainty and whose average difference is KL-divergence, has applications to odds-analysis all by itself. Its two primary strengths are that surprisals: (i) reduce minuscule probabilities to numbers of manageable size, and (ii) add whenever probabilities multiply.

For example, one might say that "the number of states equals two to the number of bits" i.e. #states = 2#bits. Here the quantity that's measured in bits is the logarithmic information measure mentioned above. Hence there are N bits of surprisal in landing all heads on one's first toss of N coins.

The additive nature of surprisals, and one's ability to get a feel for their meaning with a handful of coins, can help one put improbable events (like winning the lottery, or having an accident) into context. For example if one out of 17 million tickets is a winner, then the surprisal of winning from a single random selection is about 24 bits. Tossing 24 coins a few times might give you a feel for the surprisal of getting all heads on the first try.

The additive nature of this measure also comes in handy when weighing alternatives. For example, imagine that the surprisal of harm from a vaccination is 20 bits. If the surprisal of catching a disease without it is 16 bits, but the surprisal of harm from the disease if you catch it is 2 bits, then the surprisal of harm from NOT getting the vaccination is only 16+2=18 bits. Whether or not you decide to get the vaccination (e.g. the monetary cost of paying for it is not included in this discussion), you can in that way at least take responsibility for a decision informed to the fact that not getting the vaccination involves more than one bit of additional risk.

More generally, one can relate probability p to bits of surprisal sbits as probability = 1/2sbits. As suggested above, this is mainly useful with small probabilities. However, Jaynes pointed out that with true-false assertions one can also define bits of evidence ebits as the surprisal against minus the surprisal for. This evidence in bits relates simply to the odds ratio = p/(1-p) = 2ebits, and has advantages similar to those of self-information itself.

Applications in games of chance

Information theory can be thought of as a way of quantifying information so as to make the best decision in the face of imperfect information. That is, how to make the best decision using only the information you have available. The point of betting is to rationally assess all relevant variables of an uncertain game/race/match, then compare them to the bookmaker's assessments, which usually comes in the form of odds or spreads and place the proper bet if the assessments differ sufficiently. [5] The area of gambling where this has the most use is sports betting. Sports handicapping lends itself to information theory extremely well because of the availability of statistics. For many years noted economists have tested different mathematical theories using sports as their laboratory, with vastly differing results.

One theory regarding sports betting is that it is a random walk. Random walk is a scenario where new information, prices and returns will fluctuate by chance, this is part of the efficient-market hypothesis. The underlying belief of the efficient market hypothesis is that the market will always make adjustments for any new information. Therefore no one can beat the market because they are trading on the same information from which the market adjusted. However, according to Fama, [6] to have an efficient market three qualities need to be met:

Statisticians have shown that it's the third condition which allows for information theory to be useful in sports handicapping. When everyone doesn't agree on how information will affect the outcome of the event, we get differing opinions.

See also

Related Research Articles

<span class="mw-page-title-main">Kolmogorov complexity</span> Measure of algorithmic complexity

In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity, Solomonoff–Kolmogorov–Chaitin complexity, program-size complexity, descriptive complexity, or algorithmic entropy. It is named after Andrey Kolmogorov, who first published on the subject in 1963 and is a generalization of classical information theory.

Information theory is the mathematical study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. The field, in applied mathematics, is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to , the entropy is

<span class="mw-page-title-main">Independence (probability theory)</span> When the occurrence of one event does not affect the likelihood of another

Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of one does not affect the probability of occurrence of the other or, equivalently, does not affect the odds. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

In probability theory, odds provide a measure of the likelihood of a particular outcome. When specific events are equally likely, odds are calculated as the ratio of the number of events that produce that outcome to the number that do not. Odds are commonly used in gambling and statistics.

<span class="mw-page-title-main">Hypergeometric distribution</span> Discrete probability distribution

In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the probability of successes in draws, without replacement, from a finite population of size that contains exactly objects with that feature, wherein each draw is either a success or a failure. In contrast, the binomial distribution describes the probability of successes in draws with replacement.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

The expected utility hypothesis is a foundational assumption in mathematical economics concerning decision making under uncertainty. It postulates that rational agents maximize utility, meaning the subjective desirability of their actions. Rational choice theory, a cornerstone of microeconomics, builds this postulate to model aggregate social behaviour.

In gambling, economics, and the philosophy of probability, a Dutch book or lock is a set of odds and bets that ensures a guaranteed profit. It is generally used as a thought experiment to motivate Von Neumann–Morgenstern axioms or the axioms of probability by showing they are equivalent to philosophical coherence or Pareto efficiency.

<span class="mw-page-title-main">Laplace distribution</span> Probability distribution

In probability theory and statistics, the Laplace distribution is a continuous probability distribution named after Pierre-Simon Laplace. It is also sometimes called the double exponential distribution, because it can be thought of as two exponential distributions spliced together along the abscissa, although the term is also sometimes used to refer to the Gumbel distribution. The difference between two independent identically distributed exponential random variables is governed by a Laplace distribution, as is a Brownian motion evaluated at an exponentially distributed random time. Increments of Laplace motion or a variance gamma process evaluated over the time scale also have a Laplace distribution.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model instead of P when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

Lottery mathematics is used to calculate probabilities of winning or losing a lottery game. It is based primarily on combinatorics, particularly the twelvefold way and combinations without replacement.

<span class="mw-page-title-main">Kelly criterion</span> Formula for bet sizing that maximizes the expected logarithmic value

In probability theory, the Kelly criterion is a formula for sizing a bet. The Kelly bet size is found by maximizing the expected value of the logarithm of wealth, which is equivalent to maximizing the expected geometric growth rate. Assuming that the expected returns are known, the Kelly criterion leads to higher wealth than any other strategy in the long run. J. L. Kelly Jr, a researcher at Bell Labs, described the criterion in 1956.

Intuitively, an algorithmically random sequence is a sequence of binary digits that appears random to any algorithm running on a universal Turing machine. The notion can be applied analogously to sequences on any finite alphabet. Random sequences are key objects of study in algorithmic information theory.

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K. Baker.

<span class="mw-page-title-main">Quantities of information</span>

The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm. Although "bit" is more frequently used in place of "shannon", its name is not distinguished from the bit as used in data-processing to refer to a binary value or stream regardless of its entropy Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.

In statistics, probability theory and information theory, pointwise mutual information (PMI), or point mutual information, is a measure of association. It compares the probability of two events occurring together to what this probability would be if the events were independent.

The mathematics of gambling is a collection of probability applications encountered in games of chance and can get included in game theory. From a mathematical point of view, the games of chance are experiments generating various types of aleatory events, and it is possible to calculate by using the properties of probability on a finite space of possibilities.

References

  1. Jaynes, E.T. (1998/2003) Probability Theory: The Logic of Science (Cambridge U. Press, New York).
  2. Kelly, J. L. (1956). "A New Interpretation of Information Rate" (PDF). Bell System Technical Journal . 35 (4): 917–926. doi:10.1002/j.1538-7305.1956.tb03809.x. Archived from the original (PDF) on 2019-04-27. Retrieved 2019-09-05.
  3. Thomas M. Cover, Joy A. Thomas. Elements of information theory, 1st Edition. New York: Wiley-Interscience, 1991. ISBN   0-471-06259-6, Chapter 6.
  4. Tribus, Myron (1961) Thermodynamics and Thermostatics: An Introduction to Energy, Information and States of Matter, with Engineering Applications (D. Van Nostrand Company Inc., 24 West 40 Street, New York 18, New York, U.S.A) ASIN: B000ARSH5S.
  5. Hansen, Kristen Brinch. (2006) Sports Betting from a Behavioral Finance Point of View Archived 2018-09-20 at the Wayback Machine (Arhus School of Business).
  6. Fama, E.F. (1970) "Efficient Capital Markets: A Review of Theory and Independent Work", Journal of Financial Economics Volume 25, 383-417