Forecasting Nationals and Ranking Teams: Computers v. Power Rankings

Want to know what to expect at Nationals this week? Who will do well, who will tank, and who might win? Is it enough to check out the Ultiworld power rankings at the top of the screen and read our tournament previews?

Maybe. Over the past several years, subjective rankings of teams, like the Ultiworld power rankings, have generally predicted Men’s Nationals performance better than the computer-calculated statistical algorithms hosted by USA Ultimate. However, that result is much less robust when Women’s teams are taken into account.

Methodology and Men’s Results

We conducted our analysis by first looking at men’s teams at the past three major championships.1  While three championships is a tiny sample size with which to draw firm conclusions, subjective ranking of elite teams is a relatively new development. It’s the best we can do until we can update our analysis in the next few years.

In two of the three championships examined, a subjective ranking performed better than all other objective measures.  Skyd Magazine’s final April rankings of the 2012 College teams was incredibly accurate.  The teams that Skyd ranked 1 through 4 all finished in the top 4. For the 2013 College Championships, Skyd didn’t produce an April ranking, but Ultiworld did.  While the Ultiworld 2013 college power rankings were “good” rather than amazing predictors, they still did better than the computer-based alternatives.

The 2012 Club Championships were much less friendly to the subjective Skyd rankings2. Skyd ranked Johnny Bravo 3rd that year, Chain 4th, Rhino 5th, and Doublewide 9th.  Doublewide went on the win the Club Championships, while Bravo and Rhino both missed the quarters.3

Our approach was a bit more advanced than simply cycling through a few teams at the past few championships.  Brainstorming the potential subjective rankings was an easy first step, as Skyd and Ultiworld are the only real options.

But the number of potentially objective rankings is a bit larger. There are the final USAU rankings, a 1-20 ranking that determines bid allocation. Those rankings are generated from a PR score that is also posted on USAU’s website. The PR score is different from the older, RRI ranking that is linked into teams’ score reporter pages. You can also draw from a team’s finish at the previous championships when formulating a prediction.

We then looked to see how well each ranking system performed in each year that it was available immediately prior to Nationals.4. We correlated the rankings to national performance and also ran linear regression analysis. If a team wasn’t ranked in a pre-Nationals ranking, they were dropped from the analysis.

When it comes to the objective rankings, different objective measures performed better in different years. This suggests that even under the umbrella of objective or computer rankings, it’s hard to know what to trust.5

What about Predictions on the Women’s Side?

My sense is that predicting performance in the women’s game, at least at the club level, is slightly different.  First, there has been Fury-Riot dominance at the top. Second, there is a school of thought that the women’s playing field “plays bigger” than the men’s field because the cutters are slower and the throws a bit less far. You can argue from this that the odds of upsetting a more talented women’s team is lower than what we might see in the men’s division.

We followed a similar methodology on data from the 2012 Women’s College Championship and the 2013 Club and College Championships.

We found that the objective computer-based rankings performed better in the women’s division than in the men’s.  Ultiworld had a strong prediction at the 2013 Women’s College Championship, but Skyd rankings had poor predictive power in all three events. The Ultiworld prediction was also best at predicting how the worst teams would finish, which is probably the opposite of what you want from a forecasting system6.

In general, I suspect that the website driven power rankings better predict men’s results because they spend more time following the men’s division. Another possibility is that the subjective rankings just got luckier on the men’s side and aren’t actually better predictors than the ranking systems.  Notably, the ranking systems predicted about equally well in both divisions; it was the subjective power rankings that varied so widely.

When all of the men’s and women’s data is combined, the power-ranking approach is a slightly better predictor than the primary USAU ranking systems. Both systems make better predictions when you incorporate information from the previous year’s nationals. One surprisingly good prediction system is to combine a team’s RRI numerical ranking with its power ranking.7

Why Do Websites Perform Better?  What Theories Underlie Each System?

Subjective systems, like Ultiworld’s power rankings, have a clear informational advantage over the objective.  On a very simple level, a subjective ranker can look at the RRI ranking, the PR ranking, and last year’s nationals results when formulating a forecast. They also know a bit about the injury report, or if any mid-season personnel or strategic adjustments mean that a team’s early results will be poor predictors of postseason success.

The power rankers also get to see each team play and understand the team’s beyond simple score lines; this could be especially valuable in Ultimate because teams don’t play a lot of games in the regular season8.

Objective rankings have psychological advantages that shouldn’t be understated.  Humans are known to be biased based on availability and popularity. The Ultiworld Power Rankings are based on more familiarity with East Coast teams, while Skyd may be a bit more familiar with West Coast teams (see the Rhino #4 ranking in 2012).

The familiarity bias goes beyond geography, though: A team that impressed recently will likely weigh more heavily in a person’s mind than a similar performance early on in the season. And people tend to rate wins and losses more heavily than scoring differentials, even though the latter is generally more predictive of future success in sports. Bias is a psychological term of art in this context: It’s not as though recent performance isn’t important, or that wins don’t matter – they do.  But humans sometimes can’t help but slightly overrate those factors when formulating their own rankings and predictions.

What Makes For An Even Better Forecast?

I see one common weakness in all of the ranking systems, both subjective and objective.  All of the ranking systems are insufficiently Bayesian.  Bayesian statistics is a complex term but the intuition is very simple: Every piece of information matters – but only to a certain extent.

You should start each season with the baseline expectation that last year’s top finishers (and previous powerhouses) will be the best.  Off-season roster moves and player development should update this expectation, but shouldn’t do so too much: The best teams often have great systems in place that absorb roster additions and player losses better than the average Nationals team.

For example, the number one problem with the Ultiworld club rankings is that it tends to over respond to the most recent tournament and it values head to head games too much.9 This is overvaluing certain pieces of information. Instead, you should draw almost no ranking conclusion from close head to head game for two simple reasons.  First, head to head games are usually a sample size of one game – and anything can happen in a single game.  Second, by halfway through the season, you’ll be completely unable to rank on this criterion:  At the Chesapeake Invite, Ironside beat GOAT – but only by two points.  That suggests to me that the teams are almost even, and one weekend later GOAT won the Pro Flight Finale where Ironside finished last.

Supporting the theory that the subjective rankings were over responding to recent events is the fact that all systems are better predictors when the regression included prior Nationals finish. It would be unwise to read too much into this small sample, but I think the result supports my theory that power rankers read too much into recent results – even at the expense of ignoring valuable information from prior seasons.10

Another adjustment that would improve these rankings is to ditch the simple 1 through 20 team format. Tiering or scoring the teams would give a more accurate representation of what the subjective ranker actually thinks.  For example, Ultiworld could assign the elite teams a score between 0-100 which reflected the odds that each club team would win the Championships.

The current rankings might have Doublewide, Revolver, GOAT and Bravo each with a score between 15-20: It’s likely that one of those teams would win it, but we are also expressing a lot of uncertainty as to which of those four is best. The system could have a number 5 team (like Machine) with a score of 6.  This would express that there is a bit of a gap between the top 4 and the next 4 or 5 teams.

It’s somewhat ironic that we’ve always had computer rankings in Ultimate, but Skyd and Ultiworld rankings are new.  For most other sports, it’s been the other way around. But expect to see more research into ranking systems in the future.

In the NBA, there are various power rankings and at least one widespread computer algorithm (by former ESPN writer John Hollinger). One reddit thread ranked 21 NFL power rankings and found a numbers-driven analytical system to be the best at ranking NFL teams, yet found other analytical systems to be the worst of all. Oddly enough, the top performing system there was also the most indecisive and moved teams around the most, which goes against my critique that human rankers are too spur of the moment.11  All together, this means that we’ll know more in both Ultimate and other sports within a few years – though I also wouldn’t be surprised if some people add more insight than I have in the comments below.

What About This Year?

I pulled the current Ultiworld Power rankings and RRI rankings from USAU’s website.  The table below shows each team’s projected finish based on the coefficients from two regression equations: One equation averages RRI ranks and Ultiworld Power Rankings, while the other averages last year’s nationals finish with Ultiworld Power Rankings.  The list itself is sorted by an additional column that is an average of those last two combination predictors.  There are a few issues with the table — as you can see, the functions are tight and no team is actually projected to finish 1st or 16th. But keep an eye on the numerical differences as much as the actual rankings; both series sees the two favorites as evenly matched (Doublewide/Revolver and Fury/Riot), with a third team (GOAT and Scandal) a bit less likely to finish first.

Team NameUltiworld Power RankingRRI RankLast Year Finish at NationalsUltiworld + Last YearUltiworld + RRIBoth Projections (Averaged)
Seattle Riot1222.62.82.7
San Francisco Fury2112.72.92.8
DC Scandal333.53.84.03.9
Chicago Nemesis4464.94.74.8
Vancouver Traffic6596.65.96.2
Boston Brute Squad56127.05.86.4
Atlanta Ozone7886.77.37.0
Austin Showdown973.56.47.77.1
Denver Molly Brown101057.29.28.2
Madison Heist811118.08.78.3
Toronto Capitals11978.19.28.7
New York Bent1212NANA10.710.7
San Francisco Nightlock14141010.212.211.2
Raleigh Phoenix15131311.412.211.8
Portland Schwa13161611.412.411.9
Quebec Nova1615NANA13.313.3
Team NameUltiworld Power RankingRRI RankLast Year Finish at NationalsUltiworld + Last YearUltiworld + RRIBoth Projections (Averaged)
San Francisco Revolver2123.02.92.9
Austin Doublewide1512.33.83.1
Toronto GOAT3364.54.04.2
Chicago Machine5255.14.44.7
Boston Ironside743.55.55.95.7
Denver Johnny Bravo47126.55.76.1
Seattle Sockeye667.56.26.26.2
Raleigh Ring of Fire993.56.48.47.4
Atlanta Chain Lightning1187.58.38.98.6
New York PoNY810169.38.48.8
Minneapolis Sub Zero101213.59.59.99.7
Washington DC Truck Stop131313.510.711.411.1
Vancouver Furious George14151110.512.511.5
Santa Barbara Condors1611NANA11.911.9
Florida United1514NANA12.612.6

  1. This includes the 2012 and 2013 College Championships and the 2012 Club Championships 

  2. Ultiworld did not have club power rankings in 2012 

  3. The only good predictor of team’s performance at the 2012 club championships? Their 2011 championship finish – Ironside, Revolver, Doublewide and Chain finished in the top 4 in 2011, and three of them finished in the top 4 again in 2012. 

  4. Note the RRI score includes Nationals results, which is problematic 

  5. Again, prior year finish did especially well at the 2012 club championship. Final USAU ranking was a strong predictor of 2012 college championship performance, but was much weaker at the 2013 college championships. 

  6. Ideally, you might prefer to predict the semifinalists and finalists perfectly rather than the teams that finish 12-20 

  7. The two coefficients are about equally weighted in the regression model 

  8. More games and a larger sample size may be more important for the computer algorithms than for humans 

  9. I do have some personal knowledge of how the rankings are formulated 

  10. Let me be a bit more clear about my methodology and thinking here.  To some extent, adding every ranking system increases a model’s overall fit to the line (R-squared).  But this is a bit theoretically unsound for a variety of reasons.  First, there’s lots of covariation amongst the systems; teams that rank high in RRI are likely to rank high in USAU rankings and also rank high in Skyd or Ultiworld power rankings.  You generally don’t want to model lots of independent variables together that correlate strongly with each other.  Second, with a sample size of only three championships, there is a strong danger of overfitting a model to past data in a way that makes it a worse predictor of future championships.  There was also a nifty heuristic that I could keep a mental check on whether adding more variables to the model exceeded reason; at a certain point, the sign of one of the variables would flip, meaning that teams ranked worse were by one variable were predicted to perform better – that’s a good indication there’s too much in the model. 

  11. Hat tip to Jimmy Leppert

  1. Sean Childers
    Avatar

    Sean Childers is Ultiworld's Editor Emeritus. He started playing ultimate in 2008 for UNC-Chapel Hill Darkside, where he studied Political Science and Computer Science before graduating from NYU School of Law. He has played for LOS, District 5, Empire, PoNY, Truck Stop, Polar Bears, and Mischief (current team). You can email him at stats@ultiworld.com.

  2. Max Cohen
    Avatar

    Max Cohen is an Ultiworld statistics contributor. He is a captain of NYU's open team, Purple Haze. He lives in New York City.

TAGGED: , , , , , , , ,

More from Ultiworld
Comments on "Forecasting Nationals and Ranking Teams: Computers v. Power Rankings"

Find us on Twitter

Recent Comments

Find us on Facebook