≡ Menu

Why Haven’t We Improved At Making NFL Predictions?

Yesterday, we looked at the biggest “covers” in NFL history: those games where the final score was farthest from the projected margin of victory. In a 2010 game in Denver, the Raiders were 7-point underdogs, but beat the Broncos by 45 points. That means the point spread was off by 52 points, the most in any single game.

The first year we have historical point spread data was in 1978. That year, the average point spread was off by (or the average amount of points by which the favorite covered by was) 9.9 points. That number probably doesn’t mean much to you in the abstract, so let’s give it some context. From 1978 to 1982, the average point spread was off by 10.4 points. Over the last five years, the average point spread has been off by… 10.3 points.

Now I’m not quite sure what you expected, but isn’t that weird? In 1978, Vegas bookmakers were using the most rudimentary of models. Think of how farther along we are when it comes to football analytics than we were four decades ago. All of that work, of course, has to have made us *better* at predicting football games, right?

But don’t these results suggest that we are not any better at predicting games? If Vegas was missing games by about 10 points forty years ago, why are they still missing games by about 10 points? One explanation is that the NFL is harder to predict now, which… well, I’m not so sure about that. After all, even if you think free agency and the salary cap bring about parity (which is a debatable position regardless), it’s not like the lines are more accurate later in the season once we know more information. Games are also slightly higher scoring, and you could make the argument that we should be measuring how far games are off by as a percentage of the projected over/under?

Let’s look at the data. The graph below shows in blue the average “cover” in each game for each year since 1978.  As it turns out, 2016 was a really good year for Vegas — the average cover was just 9.0 points, which ranks as the most accurate season ever.  However, there’s no evidence that this was anything more than a one season blip: 2013 and 2015 were average years, and 2014 was the least accurate season ever.  It’s not like our prediction models just started getting sophisticated last season.

For reference, in the orange line, I have also shown the average point spread for each game.  That line has also been pretty consistent over time, with the average spread usually being just above 5 points.

On average, the “winning” team in Vegas has covered the spread by 10.4 points over the last 39 years. Should we expect that number to ever decrease? There is a certain level of unpredictability inherent in the nature of sports in general, and it’s probably even higher for football in particular. That said, perhaps the better question is why wasn’t that number higher than it was? Even if you limit our look back, consider 1994 and 1995: then, the average cover was still less than 10 points both seasons. And then think about how antiquated our prediction models were 23 years ago.

What conclusions can you draw?

  • Doug

    This is pretty interesting.

    According to my quick calculations, the average margin of victory since 1978 is 11.67 points. So if you used a pick ’em line for every game, you’d be off by an average of 11.67. If you used a line of Home Team -3 for every game, your average miss since 1978 would have been 11.31. The actual average absolute difference difference from spread has been 10.49 during the same period. So here’s what we’ve got.

    Completely Naive line: 11.67
    Naive line with HFA: 11.31
    Actual line: 10.49.

    Now the big question is: where does some sort of SRS-based line prediction fit in there? IF it looks something like this

    Completely Naive line: 11.67
    Naive line with HFA: 11.31
    SRS-based line: 10.6 ??????
    Actual line: 10.49.

    Then we have our answer, and that answer is that we’re essentially at the theoretical limit and have always been. Home field does ~30% of the work, rudimentary power rankings do ~60% of the work. Of the remaining 10%, a big chunk of it is probably the injury report. Whatever is left is just not big enough to show itself through the noise.

    But that’s just a theory.

    • Doug! Interesting thoughts – I should build that idea into a follow up post.

      • Richie

        Awesome! We’re one tiny step closer to the return of the podcast.

      • sunrise089

        Best day ever!

  • unitaryexecutive

    Love your analysis, but I think you are looking at this the wrong way.

    The spread is meant to act as a median outcome, so the ~10 point number you reference is more like a variance than error. I think the more interesting look would be whether over the course of the last 40 years what percentile of outcome was the spread (e.g. Of all the games that were -3, the favorite winning by 3 was the 51st percentile outcome).

    Then you will see whether we’ve gotten better at predicting results.

    • I’m not sure I agree with you that both the spread is supposed to act as the median rather than the mean *and* that such a distinction is material.

      That said, let’s say you’re right. Any predictions on how the results would look if I ran the numbers as you suggest?

      • unitaryexecutive

        This is probably one of those times that I wish I wasn’t typing on a phone as I probably did a crappy job of making my point.

        For what it’s worth (and I’ll admits that my expertise on these matters is more basketball than football where I *know* this is the case), I would say that in the 70s/80s favorites covered in the mid to upper 40s and now it’s almost split evenly 50/50. That would represent your prediction improvement.

        With respect to materiality, I would argue that such a change is material because say a 3 percentile change is pretty big if you’re betting at -105.

        • unitaryexecutive

          And of course football presents a special set of challenges as -3 -105 and -3 -125 are predicting two different ranges of outcomes, so you can’t just say that both are predicting a 3 point game (if that makes sense)

  • evo34

    One issue is that there is so much public money in the NFL that the closing spread is not necessarily reflecting the projection of as advanced model. I also agree with unitaryexecutive below. The variance of 10 pts. around the meadian expected outcome may just be a part of the game. No model can change this. To measure changes in accuracy, measure how close the spread (adjusted for skewed juice if needed — e.g., -3 -125 is not 3.0 pts.) comes to the 50th percentile outcome.

    • evo34

      Alternatively, look at the no-vig closing moneyline (and its implied win pct.) from Pinnacle since 2000, and see if it has gotten more accurate over time.

    • Richie

      I often wonder what would happen if Las Vegas put out some crazy lines one week. Perhaps just randomly selected a spread between +7 and -7 for every game. Then, see if the action naturally pushed the lines to their natural positions, or, would the information of a crazy line “trick” people into thinking it was a realistic line.

  • Four Touchdowns

    Not really based on much besides my own reactions off the top of my head, but this kinda falls into the same logic I use when I say I can’t really pick one player to be *the* best QB of all time — there’s just way too much random variability in each game, more than analytics can really track.

  • Richie

    Would using median instead of mean be relevant here? Last year there were only two games with point spreads of 14 or more points. But there were 79 games that ended with margins of 14 or more points. There are always going to be a couple of games that are 20+ point blowouts every week, but almost never any point spreads that high. So those blowout games are going to skew the averages.

  • Phil

    The blog starslatecodex talks about the concept of things being ‘anti-inductive’ :

    II.

    Douglas Adams once said there was a theory that if anyone ever understood the Universe, it would disappear and be replaced by something even more incomprehensible. He added that there was another theory that this had already happened.

    These sorts of things – things such that if you understand them, they get more complicated until you don’t – are called “anti-inductive”.

    The classic anti-inductive institution is the stock market. Suppose you found a pattern in the stock market. For example, it always went down on Tuesdays, then up on Wednesdays. Then you could buy lots of stock Tuesday evening, when it was low, and sell it Wednesday, when it was high, and be assured of making free money.

    But lots of people want free money, so lots of people will try this plan. There will be so much demand for stock on Tuesday evening that there won’t be enough stocks to fill it all. Desperate buyers will bid up the prices. Meanwhile, on Wednesday, everyone will sell their stocks at once, causing a huge glut and making prices go down. This will continue until the trend of low prices Tuesday, high prices Wednesday disappears.

    So in general, it should be impossible to exploit your pattern-finding ability to profit of the stock market unless you are the smartest and most resourceful person in the world. That is, maybe stocks go up every time the Fed cuts interest rates, but Goldman Sachs knows that too, so they probably have computers programmed to buy so much stock milliseconds after the interest rate announcement is made that the prices will stabilize on that alone. That means that unless you can predict better than, or respond faster than, Goldman Sachs, you can’t exploit your knowledge of this pattern and shouldn’t even try.

    Here’s something I haven’t heard described as anti-inductive before: job-seeking.

    When I was applying for medical residencies, I asked some people in the field to help me out with my interviewing skills.

    “Why did you want to become a doctor?” they asked.

    “I want to help people,” I said.

    “Oh God,” they answered. “No, anything but that. Nothing says ‘person exactly like every other bright-eyed naive new doctor’ than wanting to help people. You’re trying to distinguish yourself from the pack!”

    “Then…uh…I want to hurt people?”

    “Okay, tell you what. You have any experience treating people in disaster-prone Third World countries?”

    “I worked at a hospital in Haiti after the earthquake there.”

    “Perfect. That’s inspirational as hell. Talk about how you want to become a doctor because the people of Haiti taught you so much.”

    Wanting to help people is a great reason to become a doctor. When Hippocrates was taking his first students, he was probably really impressed by the one guy who said he wanted to help people. But since that time it’s become cliche, overused. Now it signals people who can’t come up with an original answer. So you need something better.

    During my interviews, I talked about my time working in Haiti. I got to talk to some of the other applicants, and they talked about their time working in Ethiopia, or Bangladesh, or Nicaragua, or wherever. Apparently the “stand out by working in a disaster-prone Third World country” plan was sufficiently successful that everyone started using, and now the people who do it don’t stand out at all. My interviewer was probably thinking “Oh God, what Third World country is this guy going to start blabbering about how much he learned from?” and moving my application to the REJECT pile as soon as I opened my mouth.

    I am getting the same vibe from the critiques of OKCupid profiles in the last open thread. OKCupid seems very susceptible to everybody posting identical quirky pictures of themselves rock-climbing, then talking about how fun-loving and down-to-earth they are. On the other hand, every deviation from that medium has also been explored.

    “I’m going for ‘quirky yet kind’”.

    “Done.”

    “Sarcastic, yet nerdy?”

    “Done.”

    “Outdoorsy, yet intellectual.”

    “Done.”

    “Introverted, yet a zombie.”

    “I thought we went over this. Zombies. Are. Super. Done..” ”

    http://slatestarcodex.com/2015/01/11/the-phatic-and-the-anti-inductive/

    ————————-
    anyway, there might be something similar going on with football, in that its constantly changing and evolving, its possible that it changes and evolves at a faster rate than we have instances to study and have any sort of scientific confidence in
    to create a glossed over example – a few years ago some teams started running read option concepts and had a lot of success with it
    as predictors of what would happen, how many instances of the read option did we need to see to have confidence predicting how successful teams doing that would be going forward?
    while we were amassing that sample, defenses were experimenting with different ways to defend it
    at some point the effectiveness of the read option sort of stabilizes as a concept
    but by that point, we’re on to some new conceptual evolution
    that’s what makes it tough to predict over the long term,
    that the game is constantly in a state of evolutionary flux

  • Pingback: Applied Sports Science newsletter – July 14, 2017 | Sports.BradStenger.com()