Forecasting Nationals and Ranking Teams: Computers v. Power Rankings

October 16, 2013 by Sean Childers and Max Cohen in Analysis, Rankings with 10 comments

Want to know what to expect at Nationals this week? Who will do well, who will tank, and who might win? Is it enough to check out the Ultiworld power rankings at the top of the screen and read our tournament previews?

Maybe. Over the past several years, subjective rankings of teams, like the Ultiworld power rankings, have generally predicted Men’s Nationals performance better than the computer-calculated statistical algorithms hosted by USA Ultimate. However, that result is much less robust when Women’s teams are taken into account.

Methodology and Men’s Results

We conducted our analysis by first looking at men’s teams at the past three major championships.¹ While three championships is a tiny sample size with which to draw firm conclusions, subjective ranking of elite teams is a relatively new development. It’s the best we can do until we can update our analysis in the next few years.

In two of the three championships examined, a subjective ranking performed better than all other objective measures. Skyd Magazine’s final April rankings of the 2012 College teams was incredibly accurate. The teams that Skyd ranked 1 through 4 all finished in the top 4. For the 2013 College Championships, Skyd didn’t produce an April ranking, but Ultiworld did. While the Ultiworld 2013 college power rankings were “good” rather than amazing predictors, they still did better than the computer-based alternatives.

The 2012 Club Championships were much less friendly to the subjective Skyd rankings². Skyd ranked Johnny Bravo 3^rd that year, Chain 4^th, Rhino 5^th, and Doublewide 9^th. Doublewide went on the win the Club Championships, while Bravo and Rhino both missed the quarters.³

Our approach was a bit more advanced than simply cycling through a few teams at the past few championships. Brainstorming the potential subjective rankings was an easy first step, as Skyd and Ultiworld are the only real options.

But the number of potentially objective rankings is a bit larger. There are the final USAU rankings, a 1-20 ranking that determines bid allocation. Those rankings are generated from a PR score that is also posted on USAU’s website. The PR score is different from the older, RRI ranking that is linked into teams’ score reporter pages. You can also draw from a team’s finish at the previous championships when formulating a prediction.

We then looked to see how well each ranking system performed in each year that it was available immediately prior to Nationals.⁴. We correlated the rankings to national performance and also ran linear regression analysis. If a team wasn’t ranked in a pre-Nationals ranking, they were dropped from the analysis.

When it comes to the objective rankings, different objective measures performed better in different years. This suggests that even under the umbrella of objective or computer rankings, it’s hard to know what to trust.⁵

What about Predictions on the Women’s Side?

My sense is that predicting performance in the women’s game, at least at the club level, is slightly different. First, there has been Fury-Riot dominance at the top. Second, there is a school of thought that the women’s playing field “plays bigger” than the men’s field because the cutters are slower and the throws a bit less far. You can argue from this that the odds of upsetting a more talented women’s team is lower than what we might see in the men’s division.

We followed a similar methodology on data from the 2012 Women’s College Championship and the 2013 Club and College Championships.

We found that the objective computer-based rankings performed better in the women’s division than in the men’s. Ultiworld had a strong prediction at the 2013 Women’s College Championship, but Skyd rankings had poor predictive power in all three events. The Ultiworld prediction was also best at predicting how the worst teams would finish, which is probably the opposite of what you want from a forecasting system⁶.

In general, I suspect that the website driven power rankings better predict men’s results because they spend more time following the men’s division. Another possibility is that the subjective rankings just got luckier on the men’s side and aren’t actually better predictors than the ranking systems. Notably, the ranking systems predicted about equally well in both divisions; it was the subjective power rankings that varied so widely.

When all of the men’s and women’s data is combined, the power-ranking approach is a slightly better predictor than the primary USAU ranking systems. Both systems make better predictions when you incorporate information from the previous year’s nationals. One surprisingly good prediction system is to combine a team’s RRI numerical ranking with its power ranking.⁷

Why Do Websites Perform Better? What Theories Underlie Each System?

Subjective systems, like Ultiworld’s power rankings, have a clear informational advantage over the objective. On a very simple level, a subjective ranker can look at the RRI ranking, the PR ranking, and last year’s nationals results when formulating a forecast. They also know a bit about the injury report, or if any mid-season personnel or strategic adjustments mean that a team’s early results will be poor predictors of postseason success.

The power rankers also get to see each team play and understand the team’s beyond simple score lines; this could be especially valuable in Ultimate because teams don’t play a lot of games in the regular season⁸.

Objective rankings have psychological advantages that shouldn’t be understated. Humans are known to be biased based on availability and popularity. The Ultiworld Power Rankings are based on more familiarity with East Coast teams, while Skyd may be a bit more familiar with West Coast teams (see the Rhino #4 ranking in 2012).

The familiarity bias goes beyond geography, though: A team that impressed recently will likely weigh more heavily in a person’s mind than a similar performance early on in the season. And people tend to rate wins and losses more heavily than scoring differentials, even though the latter is generally more predictive of future success in sports. Bias is a psychological term of art in this context: It’s not as though recent performance isn’t important, or that wins don’t matter – they do. But humans sometimes can’t help but slightly overrate those factors when formulating their own rankings and predictions.

What Makes For An Even Better Forecast?

I see one common weakness in all of the ranking systems, both subjective and objective. All of the ranking systems are insufficiently Bayesian. Bayesian statistics is a complex term but the intuition is very simple: Every piece of information matters – but only to a certain extent.

You should start each season with the baseline expectation that last year’s top finishers (and previous powerhouses) will be the best. Off-season roster moves and player development should update this expectation, but shouldn’t do so too much: The best teams often have great systems in place that absorb roster additions and player losses better than the average Nationals team.

For example, the number one problem with the Ultiworld club rankings is that it tends to over respond to the most recent tournament and it values head to head games too much.⁹ This is overvaluing certain pieces of information. Instead, you should draw almost no ranking conclusion from close head to head game for two simple reasons. First, head to head games are usually a sample size of one game – and anything can happen in a single game. Second, by halfway through the season, you’ll be completely unable to rank on this criterion: At the Chesapeake Invite, Ironside beat GOAT – but only by two points. That suggests to me that the teams are almost even, and one weekend later GOAT won the Pro Flight Finale where Ironside finished last.

Supporting the theory that the subjective rankings were over responding to recent events is the fact that all systems are better predictors when the regression included prior Nationals finish. It would be unwise to read too much into this small sample, but I think the result supports my theory that power rankers read too much into recent results – even at the expense of ignoring valuable information from prior seasons.¹⁰

Another adjustment that would improve these rankings is to ditch the simple 1 through 20 team format. Tiering or scoring the teams would give a more accurate representation of what the subjective ranker actually thinks. For example, Ultiworld could assign the elite teams a score between 0-100 which reflected the odds that each club team would win the Championships.

The current rankings might have Doublewide, Revolver, GOAT and Bravo each with a score between 15-20: It’s likely that one of those teams would win it, but we are also expressing a lot of uncertainty as to which of those four is best. The system could have a number 5 team (like Machine) with a score of 6. This would express that there is a bit of a gap between the top 4 and the next 4 or 5 teams.

It’s somewhat ironic that we’ve always had computer rankings in Ultimate, but Skyd and Ultiworld rankings are new. For most other sports, it’s been the other way around. But expect to see more research into ranking systems in the future.

In the NBA, there are various power rankings and at least one widespread computer algorithm (by former ESPN writer John Hollinger). One reddit thread ranked 21 NFL power rankings and found a numbers-driven analytical system to be the best at ranking NFL teams, yet found other analytical systems to be the worst of all. Oddly enough, the top performing system there was also the most indecisive and moved teams around the most, which goes against my critique that human rankers are too spur of the moment.¹¹ All together, this means that we’ll know more in both Ultimate and other sports within a few years – though I also wouldn’t be surprised if some people add more insight than I have in the comments below.

What About This Year?

I pulled the current Ultiworld Power rankings and RRI rankings from USAU’s website. The table below shows each team’s projected finish based on the coefficients from two regression equations: One equation averages RRI ranks and Ultiworld Power Rankings, while the other averages last year’s nationals finish with Ultiworld Power Rankings. The list itself is sorted by an additional column that is an average of those last two combination predictors. There are a few issues with the table — as you can see, the functions are tight and no team is actually projected to finish 1st or 16th. But keep an eye on the numerical differences as much as the actual rankings; both series sees the two favorites as evenly matched (Doublewide/Revolver and Fury/Riot), with a third team (GOAT and Scandal) a bit less likely to finish first.

Team Name	Ultiworld Power Ranking	RRI Rank	Last Year Finish at Nationals	Ultiworld + Last Year	Ultiworld + RRI	Both Projections (Averaged)
Seattle Riot	1	2	2	2.6	2.8	2.7
San Francisco Fury	2	1	1	2.7	2.9	2.8
DC Scandal	3	3	3.5	3.8	4.0	3.9
Chicago Nemesis	4	4	6	4.9	4.7	4.8
Vancouver Traffic	6	5	9	6.6	5.9	6.2
Boston Brute Squad	5	6	12	7.0	5.8	6.4
Atlanta Ozone	7	8	8	6.7	7.3	7.0
Austin Showdown	9	7	3.5	6.4	7.7	7.1
Denver Molly Brown	10	10	5	7.2	9.2	8.2
Madison Heist	8	11	11	8.0	8.7	8.3
Toronto Capitals	11	9	7	8.1	9.2	8.7
New York Bent	12	12	NA	NA	10.7	10.7
San Francisco Nightlock	14	14	10	10.2	12.2	11.2
Raleigh Phoenix	15	13	13	11.4	12.2	11.8
Portland Schwa	13	16	16	11.4	12.4	11.9
Quebec Nova	16	15	NA	NA	13.3	13.3

Team Name	Ultiworld Power Ranking	RRI Rank	Last Year Finish at Nationals	Ultiworld + Last Year	Ultiworld + RRI	Both Projections (Averaged)
San Francisco Revolver	2	1	2	3.0	2.9	2.9
Austin Doublewide	1	5	1	2.3	3.8	3.1
Toronto GOAT	3	3	6	4.5	4.0	4.2
Chicago Machine	5	2	5	5.1	4.4	4.7
Boston Ironside	7	4	3.5	5.5	5.9	5.7
Denver Johnny Bravo	4	7	12	6.5	5.7	6.1
Seattle Sockeye	6	6	7.5	6.2	6.2	6.2
Raleigh Ring of Fire	9	9	3.5	6.4	8.4	7.4
Atlanta Chain Lightning	11	8	7.5	8.3	8.9	8.6
New York PoNY	8	10	16	9.3	8.4	8.8
Minneapolis Sub Zero	10	12	13.5	9.5	9.9	9.7
Washington DC Truck Stop	13	13	13.5	10.7	11.4	11.1
Vancouver Furious George	14	15	11	10.5	12.5	11.5
Santa Barbara Condors	16	11	NA	NA	11.9	11.9
Florida United	15	14	NA	NA	12.6	12.6

This includes the 2012 and 2013 College Championships and the 2012 Club Championships ↩
Ultiworld did not have club power rankings in 2012 ↩
The only good predictor of team’s performance at the 2012 club championships? Their 2011 championship finish – Ironside, Revolver, Doublewide and Chain finished in the top 4 in 2011, and three of them finished in the top 4 again in 2012. ↩
Note the RRI score includes Nationals results, which is problematic ↩
Again, prior year finish did especially well at the 2012 club championship. Final USAU ranking was a strong predictor of 2012 college championship performance, but was much weaker at the 2013 college championships. ↩
Ideally, you might prefer to predict the semifinalists and finalists perfectly rather than the teams that finish 12-20 ↩
The two coefficients are about equally weighted in the regression model ↩
More games and a larger sample size may be more important for the computer algorithms than for humans ↩
I do have some personal knowledge of how the rankings are formulated ↩
Let me be a bit more clear about my methodology and thinking here. To some extent, adding every ranking system increases a model’s overall fit to the line (R-squared). But this is a bit theoretically unsound for a variety of reasons. First, there’s lots of covariation amongst the systems; teams that rank high in RRI are likely to rank high in USAU rankings and also rank high in Skyd or Ultiworld power rankings. You generally don’t want to model lots of independent variables together that correlate strongly with each other. Second, with a sample size of only three championships, there is a strong danger of overfitting a model to past data in a way that makes it a worse predictor of future championships. There was also a nifty heuristic that I could keep a mental check on whether adding more variables to the model exceeded reason; at a certain point, the sign of one of the variables would flip, meaning that teams ranked worse were by one variable were predicted to perform better – that’s a good indication there’s too much in the model. ↩
Hat tip to Jimmy Leppert. ↩

Sean Childers

Sean Childers is Ultiworld's Editor Emeritus. He started playing ultimate in 2008 for UNC-Chapel Hill Darkside, where he studied Political Science and Computer Science before graduating from NYU School of Law. He has played for LOS, District 5, Empire, PoNY, Truck Stop, Polar Bears, and Mischief (current team). You can email him at [email protected].
Max Cohen

Max Cohen is an Ultiworld statistics contributor. He is a captain of NYU's open team, Purple Haze. He lives in New York City.

TAGGED: algorithm, Club, Club Championships, College, Forecasting, Men's, Power Rankings, Statistics, Women's

Forecasting Nationals and Ranking Teams: Computers v. Power Rankings

October 16, 2013 by Sean Childers and Max Cohen in Analysis, Rankings with 10 comments

Sean Childers

Max Cohen

More from Ultiworld

PoNY: 2013 Club Championship Preview

College Conference Championships 2024 Roundup

Deep Look LIVE: WFDF President Robert “Nob” Rauch

College Update: First Bids to Nationals Decided

Deep Look: Easterns, NW Challenge, Assessing The Algorithm, AUDL

The Line: The Seven Deepest Teams in College Ultimate

Comments on "Forecasting Nationals and Ranking Teams: Computers v. Power Rankings"

Recent Posts

Find us on Twitter

Recent Comments

Find us on Facebook

Subscriber Exclusives

podcast with bonus segment

Subscriber podcast

Subscriber article

Subscriber podcast