The USA Ultimate algorithm isn't perfect; here's another way.
September 3, 2015 by Wally Kwong in Analysis with 28 comments
Among other sports, I’m a fan of college hockey. And, surprisingly, college hockey shares many of the same issues as ultimate when determining postseason bids: each conference receives one automatic bid, and mathematical formulas1 are used to determine the other 12 teams.2
College hockey also suffers from one of the main problems that plagues ultimate algorithms: a lack of interconnectivity between teams. So I decided to look at the formulas and ranking system adopted in NCAA hockey and seeing if they can be used to improve ultimate rankings.
One of the college hockey formulas, the Bradley-Terry system, is based on a stipulation that the probability of a team with rating A defeating a team with rating B is given by A / (A + B).
For example, NationalsTeam has a rating of 4.2. SectionalsTeam has a rating of 0.8. The likelihood of NationalsTeam beating SectionalsTeam is 4.2 / (4.2 + 0.8) = 0.84, or 84%.
But how do we figure out a rating for a single team? Let’s say we have two teams with “known” ratings: RegionalsTeam rated 1.7 and SectionalsTeam rated 0.8. We are trying to figure out the rating for UnknownTeam, which has a 2-3 record against RegionalsTeam and a 2-1 record against SectionalsTeam. We’ll guess an initial rating of 1.0 for UnknownTeam. Using this rating with the Bradley-Terry equation, in the 5 games against RegionalsTeam and 3 games against SectionalsTeam we would expect a win total of:
5 * 1.0 / (1.0 + 1.7) + 3 * 1.0 / (1.0 + 0.8)
which comes out to 3.52 games.
The fact that, in reality, UnknownTeam won a total of 4 games (instead of 3.52) means the rating for UnknownTeam needs to be adjusted up a little bit.3 Using the new adjusted rating, we can again calculate the expected wins of UnknownTeam, and then adjust the rating again.
This enables us to calculate the rating for a single team, but we aren’t adjusting ratings for RegionalsTeam and SectionalsTeam — in the example above, those ratings are held constant. To simultaneously generate ratings for all teams playing during the regular season, I followed this process:
- Give every team an initial estimated rating of 1.0.
- Using current estimated team ratings, calculate adjustments for each team using their win-loss record.4
- Apply the calculated ratings adjustments for each team to generate new estimated ratings.
- Repeat steps 2-3 until the maximum adjustment is tiny5
My initial attempt generated rankings which made no sense: undefeated teams were rewarded way too much, even when using methods to account for this weakness.
However, ultimate has an aspect that can be leveraged to generate a larger data set. Each point of a game can be considered its own “mini-game.” So if a game finished with a score of 15-8, this would count as 23 “mini-games.” Now we’ve got a far larger data set to evaluate. A team’s rating under this system is no longer related to the likelihood of winning an entire ultimate game, but the likelihood of them winning any given point in a game.
Using this “per-point” method, the 2015 USAU college regular season6 gives the following ratings (Men’s on left, Women’s on right):
Using the above ratings to determine bids to 2015 College Nationals would result in fairly similar results to the actual allotted bids under the USAU algorithm.
On the Men’s side, the sole difference is the strength bid awarded to Maryland and the Atlantic Coast region under the current USAU algorithm is instead awarded to Minnesota and the North Central region. Cincinnati would still have earned its strength bid for the Ohio Valley region.
On the Women’s side, one of the strength bids awarded by the current USAU algorithm to the South Central region is shifted to the Northeast region via Northeastern. The question of which South Central team “lost” the bid is up for debate; in the USAU rankings, Colorado College is ranked above Kansas, while the opposite is true if the above rankings are used.
Just for kicks, I’ve included the ratings using the above algorithm as applied to the 2015 USAU club season (using scores posted as of August 24, 2015). The bids would be as follows.
1 bid: Northeast
2 bids: Great Lakes, Mid-Atlantic, Northwest, South Central, Southeast, and Southwest
3 bids: North Central
Compared to the actual bids assigned, the Bradley-Terry system would result in the following bid “changes”: NE: -1; SW +1
|Team||Rating||Strength of Schedule|
|Ring of Fire||2.444||2.504|
|Galaxy Swag Universe||1.923||1.490|
1 bid: Great Lakes, Mid-Atlantic, and North Central
2 bids: South Central, Southeast, and Southwest
3 bids: Northwest
4 bids: Northeast
Compared to the actual bids assigned, the Bradley-Terry system results in no bid changes.
|Team||Rating||Strength of Schedule|
|Green Means Go||1.438||1.427|
|Colorado Small Batch||1.034||0.582|
1 bid: Great Lakes, Mid-Atlantic, and Southeast
2 bids: Northwest and South Central
3 bids: North Central, Northeast, and Southwest ((Excluding Union and Bessarion for the NE — including them would give the NE 4 bids and the NW 1 bid)).
Compared to the actual bids assigned by the USA Ultimate algorithm, the Bradley-Terry system would result in the following bid “changes”: MA: -1, NC +1, NE -1, SW +1
|Team||Rating||Strength of Schedule|
|The Chad Larson Experience||3.424||2.071|
|Mental Toss Flycoons||2.026||1.483|
|The Muffin Men||1.883||1.110|
Other Advantages to the Bradley-Terry System
Several related values can be calculated from the ratings, such as strength of schedule. Also, the probability of a winning any given game can be calculated from the probability of winning a single point7.
Expected point scores can also be calculated, along with expected point spreads. For example, Fury (rating 4.380) and Molly Brown (rating 3.414) would have the following probabilities of outcome:
|Fury Score||Molly Brown Score||Probability|
|Up to 11||15||10.98%|
|15||Up to 7||17.99%|
Under this model, Fury is expected to win 75.7% of its games with Molly Brown, with an expected margin of victory of 3.1.
Any model will have caveats. This one is no different. Each point is assumed to be played identically. There’s no consideration of offensive and defensive lines, team depth, or cap limits. If only considering effects on winning an overall game, trading points and offensive/defensive lines will serve to reduce variance and decrease the probability of a potential upset.
Capping a game and reducing its length will increase the probability of an upset.8 Of course, many of these limitations apply to the USAU algorithm as well.
The main advantage this system has over the USAU algorithm is a team rating conveys additional information about the expected outcome of a game. As shown above, win percentage and point spreads can be calculated. Also, the USAU algorithm has been shown to provide a disproportionate benefit to defeating a similarly rated team. Under the Bradley-Terry model, playing a close game against a similarly rated team will not greatly affect either team’s rating.9
If you are interested in an even more technical discussion, here is the full research paper.
Stephen Wang was immensely helpful in discussions I had with him. He suggested that I treat each point as a “mini-game”, and also steered me towards using numerical methods rather than trying to invert an enormous matrix.
There are several unofficial rankings algorithms in use for college hockey, but I took one (the Bradley-Terry system) that I find the most interesting and applied it to ultimate. ↩
In calculating the ratings adjustment for Newton’s method, I only considered the derivative of a team’s rating and win total with respect to its own rating, and ignored the differential effect of adjusting opposing teams’ ratings. ↩
less than 0.0000001 for any team in the ratings. ↩
Data was scraped off the USAU website, so some data may be included / excluded that was not used in the “official” USAU calculation. ↩
using a modified binomial distribution ↩
The details drawn from the model are likely skewed as well. The probability of a blowout win is likely overstated. Teams may have a more open rotation in such a scenario in order to rest starting players. Team depth would become more of a factor, and explicitly accounting for team depth is not done in this model. ↩
I haven’t done a full analysis on the effects of playing a higher/lower rated opponent on a team’s rating, but I don’t think it will affect it because, unlike the USAU system, the winner does not get an automatic +125 points. ↩