After studying various games' MMR (Matchmaking Rating) systems for many hours, here are the conclusions that I've come to. I'm writing it as much for myself as for others to understand things, and to find any possible flaws before moving forward. I'm by no means an expert on stats. I got an A in a college statistics class when I was 16 or 17 and that's the absolute limit of my qualifications here. I won't resent any objections or pointing out of flaws, but please keep it civil and on-topic with the idea of moving this forward.
--
While many factors can be included, there are two key elements needed for any successful modern MMR system, and that I propose we use: the rating; and uncertainty (or its reverse, confidence). For our purposes, we can use a simplification of the first Glicko system, which is itself a more dynamic version of Elo.
Wondering how Elo is calculated? Here are a few of the more straightforward and less technical explanations I've found:
https://www.omnicalculator.com/sports/elo (Continue scrolling for explanation)
https://mattmazzola.medium.com/imple...m-a085f178e065
Ratings
Ratings are scaled based on a factor that works on the basis of standard deviations. In Elo, the scale is 400, representing two standard deviations.
Elo is used for chess. If one chess player is 1600 and another is 1200, there's a difference in skill of 400, representing the fact that the player at 1600 is expected to win (or draw) about 91% of the time over the player at 1200. (For 1400 and 1200, the 1400 player is expected to win/draw about 76% of the time.)
For what we're doing, it helps to have increased granularity. So I propose starting from a rating of 3000. A higher number such as this is what a lot of games are using, and allows for a wider range of ratings. I'd also recommend using a scale of 800, or double Elo's scale, as we're using double the base rating. So a difference of 800 means 91% win probability for the higher-rated player, a difference of 400 means 76% probability, etc.
One other bonus to using 3000: it might possibly map somewhat cleanly to stars. (MMR * 2 / 1000) So 3000 starts you out at 6 stars. I think we'll see a few players above 5000, though, meaning they might be rated more like 11* or even 12*.
Uncertainty
Uncertainty represents the level to which the system believes a player's rating to be accurate. It estimates that there's a 95% chance that the player's actual rating is somewhere within this range, plus or minus their normal rating. (In combination with a base ratings change factor, this replaces Elo's somewhat arbitrary K-factor, which is simply the maximum number of points a person can win or lose in a given match.)
Uncertainty starts out usually somewhere around two standard deviations in both directions. I'm proposing to go a bit further than that: about 2.5 standard deviations, or 1000. Starting out slightly more uncertain is just fine. We'll tighten as we go, but start out by using larger rating changes in quals. Starting from 3000, this means our system with maximum uncertainty of 1000 estimates that with about 97% certainty that our player falls between the skill range of 2000 and 4000. This will change as performance is evaluated. The 1.5% best and 1.5% worst of the zone are expected to fall outside this range, basically.
The idea of uncertainty is to continue to tighten up the range and grow more certain of the rating. So, with a rating of 3000 and an uncertainty value of 500, the system would estimate that the player's actual skill level is somewhere between 2500 and 3500, with (hopefully) 95% confidence. (This would happen if a player consistently won and lost about 50% of the time against teams with an average rating of 3000. The starting rating of 3000 was found to be accurate, and after more games are played, we keep confirming that to be true.)
Uncertainty narrows the range we believe a player's actual skill to be in. Uncertainty lowers by playing matches. In more advanced systems than initially planning for ours, it also lowers more when the predicted outcome of the match is accurate. (Even matches where the outcome isn't accurate will still generally lower uncertainty, because we're changing ratings in the proper direction reflecting the outcome of the match.) Changes in rating after each match become smaller over time. It might help to think of ratings changes as moving the center point in the curve, and then tightening the curve up with each iteration/match by lowering uncertainty.
Ratings changes
Four main factors influence ratings changes: the difference in rating between the two teams; the uncertainty of each player's rating; a base factor for ratings changes; and of course, whether or not it was a win or loss for the player.
Ratings difference and win/loss
The ratings difference can be represented simply enough using the standard Elo formula. The strengths of the two teams are averaged, and then run through the formula.
Example:
Uncertainty
Uncertainty starts high, and has a big impact on how much each rating changes. The idea is that ratings change more at the start of a player's season, reflecting the system's attempts to rapidly but still effectively place them. This is similar to how the Elo system uses arbitrary K-factors of 40 for new chess players (high uncertainty) and 10 for chess pros (low uncertainty). The main difference between Elo and a modern MMR system is that uncertainty lowers automatically rather than being manually assigned.
Changing ratings based on both ratings differences and uncertainty
In standard Elo, how much your rating changes after a match is simply your probability of winning (probA or probB) multiplied by the K-factor, representing the maximum amount your rating can change. In TWD, K=50, which is why you see unexpected victories/losses in TWD give/take 50 points. TWD uses classic Elo with a high K-factor. (K is the same for everyone in TWD, which is one flaw of the system.)
In a simple MMR system, the K-factor is instead determined by uncertainty multiplied by a constant or base factor.
Base factor
We need a base factor, then. Starting with Elo's 40 for new players in a starting 1500 system, we can use a value of something above 2 x 40 (perhaps 100-250) in a starting 3000 system using self-adjusting uncertainty. This base should be larger because we'll be reducing uncertainty over time, whereas in Elo, 40 K-factor is used for a player's first 30 matches. So we need to keep the final calculation well above 80 for a good number of matches, at least 20 or so. Quals tend to use very high values to rapidly place a player, even if it may be initially flawed. Note that quals don't tell a player what their MMR is until they've finished because rating fluctuates so much and is highly inaccurate during this time.
There are a few reasons I selected 1000 as our starting uncertainty value. It's within the range of expected deviation (~2.5x standard deviations), matches what other systems have done in attempting to qualify players quickly, and is a nice round number that makes it logical to calculate. For instance, 1000 = uncertainty of 1, or full uncertainty in a player's rating.
Ratings change calculation (WIP)
As uncertainty will be different for different members of a team, players on the same team will have different ratings adjustments.
Further work needed
The main two questions that still need to be answered to get a solid, working system are linked. One decides the other and vice versa.
Stretch goals
That's it for now. Any and all comments and questions appreciated. It's taken a good bit to figure all this out and there are bound to be issues here and there. Getting the base rating and uncertainty change over time right are especially important.
If you're a hardcore nerd who loves this kind of stuff, or if you have statistical experience in general and would like to help make things work, reach out. We have a Slack channel dedicated to busting MMR/Elo for TW wide open and could use your input.
--
While many factors can be included, there are two key elements needed for any successful modern MMR system, and that I propose we use: the rating; and uncertainty (or its reverse, confidence). For our purposes, we can use a simplification of the first Glicko system, which is itself a more dynamic version of Elo.
Wondering how Elo is calculated? Here are a few of the more straightforward and less technical explanations I've found:
https://www.omnicalculator.com/sports/elo (Continue scrolling for explanation)
https://mattmazzola.medium.com/imple...m-a085f178e065
Ratings
Ratings are scaled based on a factor that works on the basis of standard deviations. In Elo, the scale is 400, representing two standard deviations.
Elo is used for chess. If one chess player is 1600 and another is 1200, there's a difference in skill of 400, representing the fact that the player at 1600 is expected to win (or draw) about 91% of the time over the player at 1200. (For 1400 and 1200, the 1400 player is expected to win/draw about 76% of the time.)
For what we're doing, it helps to have increased granularity. So I propose starting from a rating of 3000. A higher number such as this is what a lot of games are using, and allows for a wider range of ratings. I'd also recommend using a scale of 800, or double Elo's scale, as we're using double the base rating. So a difference of 800 means 91% win probability for the higher-rated player, a difference of 400 means 76% probability, etc.
One other bonus to using 3000: it might possibly map somewhat cleanly to stars. (MMR * 2 / 1000) So 3000 starts you out at 6 stars. I think we'll see a few players above 5000, though, meaning they might be rated more like 11* or even 12*.
Uncertainty
Uncertainty represents the level to which the system believes a player's rating to be accurate. It estimates that there's a 95% chance that the player's actual rating is somewhere within this range, plus or minus their normal rating. (In combination with a base ratings change factor, this replaces Elo's somewhat arbitrary K-factor, which is simply the maximum number of points a person can win or lose in a given match.)
Uncertainty starts out usually somewhere around two standard deviations in both directions. I'm proposing to go a bit further than that: about 2.5 standard deviations, or 1000. Starting out slightly more uncertain is just fine. We'll tighten as we go, but start out by using larger rating changes in quals. Starting from 3000, this means our system with maximum uncertainty of 1000 estimates that with about 97% certainty that our player falls between the skill range of 2000 and 4000. This will change as performance is evaluated. The 1.5% best and 1.5% worst of the zone are expected to fall outside this range, basically.
The idea of uncertainty is to continue to tighten up the range and grow more certain of the rating. So, with a rating of 3000 and an uncertainty value of 500, the system would estimate that the player's actual skill level is somewhere between 2500 and 3500, with (hopefully) 95% confidence. (This would happen if a player consistently won and lost about 50% of the time against teams with an average rating of 3000. The starting rating of 3000 was found to be accurate, and after more games are played, we keep confirming that to be true.)
Uncertainty narrows the range we believe a player's actual skill to be in. Uncertainty lowers by playing matches. In more advanced systems than initially planning for ours, it also lowers more when the predicted outcome of the match is accurate. (Even matches where the outcome isn't accurate will still generally lower uncertainty, because we're changing ratings in the proper direction reflecting the outcome of the match.) Changes in rating after each match become smaller over time. It might help to think of ratings changes as moving the center point in the curve, and then tightening the curve up with each iteration/match by lowering uncertainty.
Ratings changes
Four main factors influence ratings changes: the difference in rating between the two teams; the uncertainty of each player's rating; a base factor for ratings changes; and of course, whether or not it was a win or loss for the player.
Ratings difference and win/loss
The ratings difference can be represented simply enough using the standard Elo formula. The strengths of the two teams are averaged, and then run through the formula.
Example:
Code:
Team A: 3400 Team B: 3000 Ratings diff = -400 Scale factor: 800 (Our chosen constant equal to two standard deviations) probA: The probability that Team A will win. 0.5 (50%) is even odds and represents when two teams with exactly the same rating face off. probB: The probability that Team B will win. probA = 1 / (1 + (10^(-diff / scalefactor))) probA = 1 / (1 + (10^(-400 / 800))) probA = 1 / (1 + (10^(-0.5))) probA = 1 / (1 + (1/(3.16))) probA = 1 / (1 + (.316)) probA = 1 / (1 + (.316)) probA = .7597 probB = 1 - probA probB = .2403
Uncertainty
Uncertainty starts high, and has a big impact on how much each rating changes. The idea is that ratings change more at the start of a player's season, reflecting the system's attempts to rapidly but still effectively place them. This is similar to how the Elo system uses arbitrary K-factors of 40 for new chess players (high uncertainty) and 10 for chess pros (low uncertainty). The main difference between Elo and a modern MMR system is that uncertainty lowers automatically rather than being manually assigned.
Changing ratings based on both ratings differences and uncertainty
In standard Elo, how much your rating changes after a match is simply your probability of winning (probA or probB) multiplied by the K-factor, representing the maximum amount your rating can change. In TWD, K=50, which is why you see unexpected victories/losses in TWD give/take 50 points. TWD uses classic Elo with a high K-factor. (K is the same for everyone in TWD, which is one flaw of the system.)
In a simple MMR system, the K-factor is instead determined by uncertainty multiplied by a constant or base factor.
Base factor
We need a base factor, then. Starting with Elo's 40 for new players in a starting 1500 system, we can use a value of something above 2 x 40 (perhaps 100-250) in a starting 3000 system using self-adjusting uncertainty. This base should be larger because we'll be reducing uncertainty over time, whereas in Elo, 40 K-factor is used for a player's first 30 matches. So we need to keep the final calculation well above 80 for a good number of matches, at least 20 or so. Quals tend to use very high values to rapidly place a player, even if it may be initially flawed. Note that quals don't tell a player what their MMR is until they've finished because rating fluctuates so much and is highly inaccurate during this time.
There are a few reasons I selected 1000 as our starting uncertainty value. It's within the range of expected deviation (~2.5x standard deviations), matches what other systems have done in attempting to qualify players quickly, and is a nice round number that makes it logical to calculate. For instance, 1000 = uncertainty of 1, or full uncertainty in a player's rating.
Ratings change calculation (WIP)
Code:
ratings change = probability of win * (base factor * player's uncertainty). Positive for win, negative for loss. (Here, again, base factor * uncertainty are effectively replacing K.)
Further work needed
The main two questions that still need to be answered to get a solid, working system are linked. One decides the other and vice versa.
- What is the base factor for ratings changes?
- How quickly does uncertainty lower based on games played?
I need to find a good formula for this, maybe a logarithm. We need to have it high enough at the start for quals to rapidly place a player. Then it needs to level off and decrease slowly, especially with many matches played, where it should hardly change at all. Considering Elo's really basic method of making huge changes to K based on differences in play level/number of matches played, it's actually not that important to get this perfect. Elo still works. Because of this, I'm considering just using a lookup table with verified calculations to make sure it works well. If anyone's a math guy and is interested in finding a more elegant solution, let me know. Glicko and Glicko2 offer some options here but are a bit complicated for what we're trying to do. I'm trying to keep it simple in order to get it operating decently from the start. We can then refine with time. This is how pretty much every game has developed their MM algo. (As an aside for the bored, I just learned today that the lead dev on the first Sonic game changed a key element of the game two weeks before the master release was finalized, so that even if the player had one single, solitary ring, hits wouldn't be fatal. Can you imagine how the game would have changed if you had to have 20+ rings to avoid a death? The point is, iteration is key. Everything is a work in progress in gamedev.)
Stretch goals
- Adjust the change to uncertainty by how far off the prediction was. If we see a big upset, such as a team winning a game they were predicted to have a 10% chance of winning, we shouldn't be equally as confident of new ratings as in a case when the prediction was accurate.
- Allow increases of uncertainty in some cases where uncertainty is already low, such as if a player is consistently performing outside of their skill rating (either higher or lower), demonstrating that we may need to allow more substantial ratings changes until they once again level off and we can be certain they're placed where they should be. This is standard in many advanced systems.
- Increase uncertainty slowly for inactive accounts. (Glicko) This is done because the longer a player rusts up, the less reliable past ratings are.
- Use uncertainty of opponents as a factor in ratings adjustments. If we're uncertain of an opponent's rating, it makes sense to be more cautious about ratings changes. If we're very certain of an opponent's rating, obviously we can be much more certain that we're making a meaningful adjustment in the right direction.
That's it for now. Any and all comments and questions appreciated. It's taken a good bit to figure all this out and there are bound to be issues here and there. Getting the base rating and uncertainty change over time right are especially important.
If you're a hardcore nerd who loves this kind of stuff, or if you have statistical experience in general and would like to help make things work, reach out. We have a Slack channel dedicated to busting MMR/Elo for TW wide open and could use your input.
Comment