What if we ranked teams with a different system?
May 25, 2016 by Cody Mills in Analysis with 27 comments
Anybody who has even casually followed USA Ultimate’s college and club series since 2011 has surely heard somebody complain about the algorithm that USAU uses to rank teams and distribute bids. Opinions have ranged from neutrality (“just win all your games”) to tacit endorsement (“at least they don’t use the previous year’s nationals results anymore”) to harsh criticism, only occasionally coupled with an alternate proposal (shout out to probabilistic models).
But for all of its controversy, the USAU algorithm is generally competent at evaluating the relative strength of teams using only their box score results; the (rightful) point of contention is in how that algorithm, and its weaknesses, are leveraged as a part of the nationals bid distribution system. However, while there is enough fuel in the prior statement to power a series of articles, this piece will focus more on applying another common ranking tool — one used in many other competitions — to rank college ultimate teams: Elo Ratings.
Chess master and physicist Arpad Elo originally created his eponymous algorithm to improve how the US Chess Federation ranked its wide pool of competitive players. The ratings were designed to meet the task of ranking the relative skills levels of players in situations where the particular players may not necessarily meet head to head.
Elo ratings are still used to rank chess players today, and the algorithm has found its way to the desk of many other sports statisticians, producing ELO ratings for world football and the NFL, as well as applications ranking other large groups of competitors in games like Go and League of Legends.
The concept of the algorithm is quite simple: a player’s rating is re-evaluated after every contest (or collection of contests) based on the rating of his opponent and the result of the match. All matches are evaluated chronologically so that the rating change derived from the match is a function of what the players’ ratings were at the time rather than what they might end up being later (a notable difference from USAU). Further, the rating change that occurs after a game is a function of the following:
- Relative rating of the teams. Based on the initial rankings of the players, Elo ratings calculate a win expectation percentage. The rating points awarded or subtracted are relative to the result of the game vs. the expected outcome. A victorious team that had a 10% chance of winning will earn more points than a team who won after being rated as 90% favorites. The win expectation formula can be changed, but is often done such that a 400-point rating differential indicates a 90% win expectation for the favorite.
- Margin of victory (in many variations). In competitions where margin of victory (MOV) is quantifiable (ultimate, but not chess), the point change will be correlated with MOV. Different sports have different implementations of their MOV multiplier. Many implementations will award a high margin of victory with a high multiplier, but then lower it again if the winning team was rated much higher than the losing (a big win by a favorite gets less weight than a big win in an expected toss up).
- The update parameter (or k-value). The k-value represents how much influence the latest result has on one’s ranking change. High k’s mean that the latest game will have a larger effect on a player’s rating, while a low k lightens the influence.
– Each party/player’s performance in a match is a normally distributed random variable around their true “skill level.”
– New players enter the rankings at a fixed “average” rating (generally 1500)
– A player can never hurt their ranking by winning. The size of the benefit may be affected by margin of victory and the relative ranking difference between the players, but ranking will never decrease. This differs from the USAU model where a close loss may help or hurt a ranking.
– In the same way, a loss can never help a player’s ranking. The size of the penalty may range based on the margin of loss and the relative ranking difference between the players, but ranking will never increase
– Rankings are state functions. A player’s future rating is independent of past results– it is conditioned only on his current ranking, and on the performance in his next match. This is in contrast to the USA Ultimate ranking where the future rating is calculated by a new result plus all past results.
– The performance curve. The assumption of a normal distribution of performance, while likely accurate in the infinite, introduces bias into the system.
– Provisional rankings. When a player first enters the system, they are given the ‘average’ rating. If the player’s true skill is much higher or lower than average, then for the first n games the player participates in his opponents are being disproportionately rewarded/penalized
– Tuning the k-factor. There is definitive guide for how to tune the update speed of the algorithm. While in general high-sample sports like baseball lend themselves to low values, and lower-sample sports like American Football favor faster updates, much of the guidance for the update speed is simply heuristic analysis by the investigator.
Adaptations for Ultimate
– In order to avoid the bias of provisional rankings, the teams were given initial Elo ratings based on the prior year’s USAU ranking. The top team in the 2015 USAU rankings (Pittsburgh) was initialized to 1700, with each subsequent team initialized at one point lower. Teams that were not ranked in 2015 received the average rating of 1500 when they entered the system. The women’s receive similar initialization but starting at 1650 so that the average initialized rating was also 1500.123
Below are the elo rankings generated for the 2016 season, including Conferences and Regionals.
|9||Case Western Reserve||1888.41|
|34||Lewis & Clark||1795.67|
The win expectation formula was tweaked so that a 300 point differential was a 90% expectation. ↩
The MOV multiplier used was log(pointdiff + 1)(2.2/[(elo_diff)*.001+2.2)] ↩
The k-factor used was 20, which is comparable to the factor used in models for sports that play ~20 games per season ↩