Continued from Part I:
What is to be done?
So, why not allow margins of victory to go into ratings, to ensure they are as "accurate" as possible? The primary argument is that if larger margins of victory helped teams in the ratings, then some teams might be tempted to pile on meaningless points in lopsided wins in an attempt to inflate their ratings, thus improving their odds of landing in a BCS bowl. Thus, ignoring scores is an effort to avoid encouraging unsportsmanlike conduct.
However, there are real-world controls that inhibit running up the score. Running up the score generally involves leaving first-string players in the game long after it is out of reach. Any coach who does that to excess is not wise, as the risk of injury to their best players should outweigh any desire to run up a score. Even one needless injury could jeopardize a future victory against a competitive rival.
Another natural incentive to not run up scores is that any coach who does so is likely to be paid back in kind in a future season when their team is not so good. Certainly scores have been run up intentionally, but I would argue that it is a rare event, and few coaches really want to do it when the opportunity arises.
On the theoretical side, people who have designed power ratings that use scores have long known it is not wise to use a raw score when there is an extreme blow-out. Taking an example from history, Florida beat Central Michigan 82-6 in 1997. A margin of victory greater than 70 points is exceedingly rare in games between Division-1A teams. Because of this rarity, no serious prognosticator ever predicts a 76-point margin, even when the best team is playing the worst team. The best teams probably could beat the worst teams by 76 points nearly every time, but since coaches have more important priorities than huge margins, it simply does not happen often. When teams get ahead by 35 or 42 points, their senior players will be rested, and thereafter no one can predict whether the winning team's reserves will be energetic or unmotivated.
Since most designers of ranking systems want their ratings to be accurate, a score like 82-6 simply will not be treated as 82-6. There are any number of ways to adjust blow-out scores to something more realistic. My goal is not to cover statistical mathematics, but just as an example, a system designer might look at the fact that Las Vegas rarely sets point spreads greater than 50 points, and thus might adjust an 82-6 score to 56-6 in order to produce more accurate ratings. If such a step is not taken, then for subsequent games the power ratings will produce unrealistic predictions for the two teams involved in the blow-out.
As a matter of fact, if Florida had played Central Michigan a second time in 1997, we can be certain that the point spread would not have been 76. It would have probably been something more like 50 (the exact number would have depended on where the game was to be played). It only takes a little experience monitoring predictions and point spreads to confirm this statement. Thus, for any point-based rating system to be taken seriously, the designer must take precaution against overvaluing extreme results.
To illustrate the need for dampening blow-out results, we can look at a simple rating system that treats all results as equally important, and makes no adjustments to outcomes other than for home-field advantage. I call the method "networked transitive comparison" (NTC) because it amounts to linking all teams mathematically with the theory that if Team A beats Team B by x points, and Team B beats Team C by y points, then Team A is x+y points better than Team C. Transitive comparisons are famous and fun because extremely absurd chains of logic can often be found where some obviously bad team is shown to be superior to several of the elite teams of the season. For example, a transitive chain of 31 links can be found for the 2007 season that shows Division III Kenyon was 377 points better than Kansas (the Division-1A team with the best won-lost record of the year). Regardless of such amusements, when all such transitive comparisons are made and averaged out, the result is numerical power ratings that will place most teams roughly where they belong relative to others (so long as teams have played more than a few games each). Kenyon does not come out ranked above Kansas when a full analysis of all scores is performed.
An NTC-style rating can be calculated in several ways, and the procedure has probably been given many different names. More detail on it would be appropriate for its own article. However, the details may be ignored if we simply look at some NTC ratings from the past and note that of the dozens of people publishing power ratings in recent years, many similar ratings could be found.
An interesting case is the 2002 season. Below are the power ratings as calculated by my NTC program. This program adds three points to the scores of all road teams.
team W L power ---------------------------------------- 1 Kansas State 11 2 33.92 2 Southern Cal 11 2 31.12 3 Oklahoma 12 2 27.41 4 Miami (Florida) 12 1 24.75 5 Texas 11 2 22.90 6 Georgia 13 1 21.16 7 Penn State 9 4 19.98 8 Iowa 11 2 19.80 9 Ohio State 14 0 19.25 10 Alabama 10 3 18.64
Ohio State, the only undefeated Division-1A team, was the consensus National Champion of 2002. However, Ohio State did not win by convincing margins in half of their games. On the other hand, Kansas State lost two close games, and won several games by extreme margins (68-0, 64-0, and 58-7, to name a few). Southern California and Oklahoma followed suit, but with not quite as extreme blow-outs.
It is not easy to document now, but among the rating systems that were being published on the Web that year, a large fraction did not place Ohio State at #1. Jeff Sagarin's "Predictor" method had Kansas State #1, and Ohio State #8, nearly matching the NTC rankings. Sagarin is a BCS selector, although his Predictor system is not a part of it. Sagarin's Elo chess system is used by the BCS, and it ranked Ohio State #1. (Note, if one looks up the 1997 Sagarin ratings on the Internet, the rankings found will not be those of Sagarin's Elo chess system, as he only started publishing those rankings after the BCS's prohibition on using scores for BCS rankings.)
Clearly, the inclusion or exclusion of scores in the process of calculating ratings makes a drastic difference. What is interesting is the impact is not always what the BCS would hope to see. The 2002 Ohio State team benefits from the exclusion of scores, but the 1997 Nebraska and Michigan teams suffer from it. The best methods are actually somewhere in the middle - scores help improve rating accuracy, but they should not be overly-relied upon or taken too literally.
What's more, while using scores to make computer ratings does generally make for more appealing rankings (in my opinion), scores-based rankings could still leave undefeated powers out in the cold in the BCS. The system I have put the most time into (and that makes the ratings I post on my web site) ranks one-loss Florida State above Michigan for 1997. Some fans would find that outrageous. To respond to those fans we have to go into another long-standing debate about ranking methodology. That is, do undefeated teams always deserve to be ranked above teams with losses?
It depends on your definition of "deserve." With my ratings I am more interested in a basis for prediction. My ranking of Florida State above Michigan in 1997 is only a guess that Florida State would have been slightly favored if the teams had met. Being undefeated does not mean a team is guaranteed to be favored in upcoming games! Undefeated teams are quite commonly underdogs, even late in the season.
This is the heart of the matter. What is the purpose of computer rankings? The BCS has a purpose of setting up a true national championship game (or so we assume). With such an important purpose, should they not make certain they use the best ranking models? I, for one, am not convinced they made much of an effort to investigate the quality of their models.
No one would take seriously a bowl-selection system that produced output like we have seen for 1997 or 2002. However, the NTC method is also capable of producing perfectly reasonable ratings that would not be controversial. The same could happen with any numerical ranking method. Ratings might appear reasonable for several seasons, but eventually there will probably be a year where the particular jumble of scores just does not allow for ratings that most fans would find realistic.
Once we appreciate this, an obvious conclusion is that evaluating a ranking system has to involve looking at rankings for a large number of seasons. Systems that come up with the smallest number of plainly objectionable rankings (such as Ohio State far removed from #1 in 2002) are the types of systems that should be employed by the BCS. I do not see how anyone could object to this common-sense notion, and yet the BCS organizers have never supplied historical ratings from their chosen systems, nor have they said whether or not they examined many seasons of past ratings in choosing their systems, nor have they come out with any statements whatsoever on how they decided which systems were of high enough quality to merit inclusion.
I have no desire to impugn the work of the people behind the various rating systems of the BCS. I am sure they do as well as can be done without consideration of scores. Rather, I am simply pointing out that those managing the BCS system have not demonstrated a knowledge of computer ratings, and they have not supplied evidence that they are using the best systems available. Good evidence could be provided by simply publishing the ratings their systems would have produced going back a few decades (or even all the way back to 1869) for all interested parties to compare and discuss.
Until basic steps like that are taken, the BCS should expect nothing but sarcasm and skepticism from a fan community that rightfully views their ratings as a proverbial black box.
Any number of systems can be designed that realistically handle extreme results, and in general, the math involved is so complex that there is no way any coach could anticipate what game outcomes would most enhance their power ratings. On top of that, if several such systems were used by the BCS, there would be absolutely no way to predict the benefit to be had from any particular score, whether it be a blow-out or not. Coaches would not be given any new incentive for unsportsmanlike conduct if BCS computer systems considered scores.
However, rather than advocating the adoption of systems that incorporate scores, my recipe for getting rid of the annual BCS hype and consternation would be to just do away with their ratings scheme. The formula that combines the computer ratings and opinion polls produces an impression that science has an answer. If only enough numbers are thrown around then the system must be right! That's the illusion promulgated as they tweak their formula over time.
The truth is that ratings and predictions are a very tricky business. The average prediction error for computer systems and betting lines is around 12 points. Predictions that are off by 20 or 30 points are just about as common as predictions that are right on the nose. Given that, who can really believe that there is some absolute truth as to which two teams deserve to be in a National Championship game? Just as no prediction is a "lock," no ranking method is perfect, nor is any consortium of methods perfect.
The NCAA/BCS should study the historical rankings of several systems to find out which are the most reliable over the long term. Then, rather than dogmatically believing in a formula that combines different rankings, the best systems (whether or not they utilize scores) should be used as baseline references, not as commandments set in stone.
Unfortunately, using computer ratings as only a tool to aid human judgement leaves humans to make the final decisions. Then things are opened up for charges of bias, politics, and so forth to muddy the waters. In that sense, I do admire the spirit of the BCS rankings. For the appearance of fairness, what better to resort to than pure mathematics? However, as has been clearly demonstrated in this article and elsewhere, no mathematical system is perfectly reliable, and no two systems agree.
My suggestion for maximizing happiness would be that the BCS adopt a new philosophy. They should simply state that if two and only two "major conference" Division-1A teams go undefeated in a season, those two teams should play in the championship game. If the number of undefeated teams is anything other than two, then the top two teams in the BCS rankings should be chosen. (Note, such a scheme would require more qualifications and possibly new rules on non-conference scheduling practices, but this suggestion is not the focus of the article, so I will leave it at that.) And to lend more credibility to their rankings, the BCS should do a historical review of rankings to prove to the public that they have studied the issue.
Of course, another option to change things would be a play-off. However, the same problems would still exist in that some method for choosing and seeding the play-off field would be needed, and that process would likely be under an even brighter spotlight. The NCAA and its BCS group have been warned: If they are going to pretend to be scientists, they had better back up their theories with evidence.
Posted October 22, 2008
Copyright 2008. All rights reserved.
Jon Dokter