TrueSkill is a skill-based ranking system developed by Microsoft for use with video game matchmaking on the Xbox network. Unlike the popular Elo rating system, which was initially designed for chess, TrueSkill is designed to support games with more than two players. [1] [2] [3] In 2018, Microsoft published details about an extended version of TrueSkill, named TrueSkill2. [4]
It is based on a Thurstonian model with a Gaussian score distribution. It does not satisfy Luce's Choice Axiom. [5]
TrueSkill was developed by researchers at Microsoft Research as a Bayesian skill rating system designed for online games in which players often compete in teams and matches may involve more than two players or teams. [2] In their first published description of the system, Ralf Herbrich, Tom Minka and Thore Graepel evaluated the method using match data collected by Bungie Studios during the beta test of Halo 2 and described its operation in the Xbox Live service. [2]
Christopher Bishop later wrote that TrueSkill was deployed on Xbox Live in 2005 and has operated continuously since then, processing millions of game outcomes per day. [6] Herbrich and colleagues reported that, as of September 2005, Xbox Live had over 2 million subscribed users and that the Xbox 360 Live service used TrueSkill for automatic player rating and matchmaking, processing hundreds of thousands of games per day. [2]
In a survey of ranking systems for games and assessments, Educational Testing Service researchers described TrueSkill as a generalization of Elo-style ratings for multiplayer and team competitions and noted its use in Microsoft's Xbox online games. [7] Later research from Microsoft proposed revisions and extensions to the original model, including TrueSkill2. [4]
In the model described by Herbrich, Minka and Graepel, each player's skill is treated as an unobserved (latent) quantity, and the system maintains a probability distribution over that skill. [2] The prior over player skills is assumed to factorise across players, with each individual skill modelled as a Gaussian distribution with mean and variance representing the system's uncertainty about that player's skill. [2]
Match outcomes are modelled as arising from noisy performances. For a given game, each player generates a latent performance value that is normally distributed around their skill , with a fixed performance variance parameter : [2] The parameter controls how strongly a single game's result is expected to vary around underlying skill (smaller implies less randomness in outcomes). [2]
For games with teams, the model assumes an additive team-performance structure: the latent performance of team is the sum of the performances of its members, , where is the set of players assigned to that team. [2] A match result is represented as a ranking of the teams (with ties possible), and the likelihood of the observed ranking is defined in terms of inequalities between the corresponding team-performance variables. [2]
Draws are incorporated by introducing a draw margin . For two adjacent teams in the ranking, a win is modelled by requiring the higher-ranked team to exceed the lower-ranked team by more than , while a draw is modelled by requiring their performance difference to lie within . [2] Herbrich and colleagues related to an assumed or empirically estimated draw probability, allowing the draw rate to be calibrated for a given game mode. [2]
TrueSkill represents the joint model of skills, performances and observed outcomes as a factor graph. [2] Inference is then framed as the computation of approximate single-variable marginals by message passing using the sum-product algorithm. [2] For a single match, most of the messages in the graph can be represented as one-dimensional Gaussians, which makes the update computationally efficient. [2]
A key difficulty is that the outcome constraints (wins and draws) introduce non-Gaussian terms. In the factor graph, these appear as comparison factors that impose inequalities on performance differences (and, for draws, on an interval around zero). [2] The corresponding exact messages are non-Gaussian; the TrueSkill paper applies expectation propagation (EP) to approximate these messages by moment matching, replacing the intractable truncated-Gaussian forms with Gaussian approximations that have the same mean and variance. [2] [8] Because the comparison messages are approximate, the algorithm iterates message updates along paths connecting the affected variables until the approximate marginals stabilise. [2]
After inference for a match, each player's updated marginal remains Gaussian and can be summarised by an updated mean and standard deviation for use in later matches. [2] The original system also includes a dynamics variance parameter (often written ) to model skill changes over time between games. [2]
TrueSkill maintains a Gaussian belief distribution for each player's skill, but practical deployments must also choose a numerical scale and values for model parameters such as the performance noise and the rate at which skills are allowed to drift over time. [2] Herbrich and colleagues reported that Xbox Live used an initial prior scale with and , corresponding to a prior in which negative skills are unlikely. [2]
In the same deployment description, the variance of a single-game performance was set relative to the prior uncertainty as , and the between-game dynamics variance (modelling skill changes over time) was set to . [2]
For player-facing displays such as leaderboards, Xbox Live displayed a conservative estimate rather than the posterior mean alone. The TrueSkill paper described displaying a player's skill as the 1% lower quantile of the belief distribution, . [2] Herbrich and colleagues wrote that this choice was intended to ensure that top positions in leaderboards were held by players who were both highly rated and estimated with high certainty, and noted that the prior implied an initial displayed value of . [2]
The following worked example illustrates how TrueSkill updates ratings after a three-player free-for-all match, using the model and win-update equations described by Herbrich, Minka and Graepel. [2] In TrueSkill, each player has a skill belief distribution , where is the mean skill estimate and is the uncertainty.
After a match, the system combines the prior belief with the match result to obtain a new belief distribution (the posterior). The posterior mean is the mean of that updated distribution, i.e., the system's best single-number estimate of the player's skill given the evidence so far. [2]
In the original Xbox Live scale described in the TrueSkill paper, a common choice for the performance noise is . [2] This example uses that value and assumes no draws.
In some deployments, a conservative skill estimate is displayed rather than the posterior mean. The TrueSkill paper describes displaying the 1% lower quantile of the belief distribution, which for a Gaussian distribution is approximately . [2] In plain terms, this is a value that the system expects the player's true skill to exceed about 99% of the time, given the current uncertainty. As a result, a player with a high mean but large uncertainty (large ) will have a lower displayed rating than a similarly rated player whose skill is estimated more confidently. [2]
Three players enter a match with the following prior ratings:
| Player | (before) | (before) | Displayed (before) |
|---|---|---|---|
| Alice | 30 | 6 | 12 |
| Ben | 25 | 7 | 4 |
| Chloe | 20 | 8 | −4 |
The game finishes with Chloe in first place, Alice in second place, and Ben in third place.
For a strict ranking with no ties, the factor-graph formulation represents the result using adjacent comparison constraints (Chloe beats Alice; Alice beats Ben). [2] There is no separate "Chloe beats Ben" factor in this minimal representation because it is implied by the ranking: if Chloe is above Alice and Alice is above Ben, then Chloe is above Ben.
In the full TrueSkill algorithm, message passing in the factor graph uses both adjacent constraints together to update all three players in a single inference problem (and may iterate). [2] The step-by-step calculations below apply the standard two-player win update to each adjacent comparison in sequence to make the mechanics transparent; Chloe's advantage over Ben is still reflected through the chain, because Chloe's win pushes Alice down, and Alice's win pushes Ben down.
For a winner and loser , define
Let and be the standard normal probability density and cumulative distribution functions. For a win (no draw margin), define
The mean updates are and the uncertainty decreases according to
Using Chloe as winner and Alice as loser, with :
Applying the update gives:
The second comparison uses Alice's updated rating from step 1, with Ben unchanged so far. Alice beats Ben :
Applying the update gives:
| Player | (before) | (after) | (before) | (after) | Displayed (before) | Displayed (after) |
|---|---|---|---|---|---|---|
| Alice | 30 | 27.66 | 6 | 4.89 | 12 | 12.97 |
| Ben | 25 | 21.48 | 7 | 5.97 | 4 | 3.56 |
| Chloe | 20 | 27.80 | 8 | 6.34 | −4 | 8.79 |
In this example, Chloe's first-place finish requires her performance to exceed Alice's, and Alice's to exceed Ben's, so the inferred skill distributions shift accordingly. Chloe's large increase in reflects that her win over a substantially higher-rated opponent was unexpected under the prior ratings, while the decreases in reflect that the match provides additional information about the players' relative skills.
In 2018, Microsoft researchers published TrueSkill2 as an extension of the original TrueSkill model that incorporates additional information available in some online games, beyond the final win–loss ordering. [4] The paper describes using signals such as player experience, squad membership, individual statistics (for example kill and death counts), quitting behaviour, and performance in other game modes, with the aim of improving the accuracy of inferred skills for matchmaking. [4]
The authors described TrueSkill2 as retaining the same interface as classic TrueSkill, a single-number skill rating intended to remain compatible with existing matchmaking systems, while changing how skills are inferred from historical data. [4] They described automatic parameter estimation from batches of historical matches, and two operating modes: an online mode that propagates skills forward in time and a batch mode that infers parameters and skills over all times (referred to as TrueSkill Through Time). [4]
Compared with classic TrueSkill, the paper lists model changes including: incorporating individual player statistics alongside team win/loss; treating mid-game quitting as a surrender for rating purposes; borrowing information across game modes by modelling skills as correlated between modes; and modelling skill evolution as biased towards improvement, especially in a player's early matches in a mode. [4] It adds an explicit squad effect, modelling players who queue together as performing better than they would individually. [4]
On match data from Halo 5, the authors reported that TrueSkill2 improved prediction of historical match outcomes, giving 68% accuracy on their evaluation compared with 52% for the original TrueSkill system. [4]
TrueSkill is part of a broader family of statistical rating systems used to infer relative strength from match outcomes. The Elo rating system, originally developed for chess, models win probabilities from rating differences and updates ratings after games based on the difference between observed and expected results. [9]
Glickman later proposed dynamic paired-comparison models that incorporate uncertainty in a player's estimated strength and allow that uncertainty to change over time, motivated in part by limitations of Elo-style updates for large competitive populations. [10] In a survey of ranking systems used in games and assessments, ETS researchers described TrueSkill as an Elo-style generalisation for multiplayer and team competitions that maintains uncertainty estimates alongside point ratings. [7]
Some related probabilistic models instead treat rankings as samples from a distribution that satisfies Luce's choice axiom. Guiver and Snelson note that a Thurstonian model with independent score variables induces a Plackett–Luce distribution if and only if the scores follow a Gumbel distribution, and they contrast this with TrueSkill, which uses Gaussian score variables. [5] They state that TrueSkill's Gaussian-score model does not satisfy Luce's choice axiom, while noting its use in large-scale online rating systems. [5]
Several research extensions build on TrueSkill's probabilistic formulation. Dangauthier, Herbrich, Minka and Graepel proposed TrueSkill Through Time, which replaces the original filtering-style updates with smoothing over a time series of player skills, allowing retrospective estimation of skill trajectories while retaining uncertainty estimates and an explicit draw model. [11]
TrueSkill is patented, [12] and the name is trademarked, so it is limited to Microsoft projects and commercial projects that obtain a license to use the algorithm. [13] The patent is scheduled to expire on April 9th, 2029. [14]
{{cite conference}}: CS1 maint: url-status (link){{cite web}}: CS1 maint: url-status (link){{cite tech report}}: CS1 maint: url-status (link){{cite conference}}: CS1 maint: url-status (link){{cite conference}}: CS1 maint: url-status (link){{cite book}}: CS1 maint: url-status (link){{cite journal}}: CS1 maint: url-status (link){{cite conference}}: CS1 maint: url-status (link)