≡ Menu

Data Snooping

Reggie Wayne dominates when seeing blue

Reggie Wayne dominates when seeing blue.

Over the last few years, the football analytics movement has made tremendous progress.  There are many really smart people with access to a large amount of useful information who have helped pioneer the use of statistics and analytics in football.  Major news organizations and NFL teams seem to be embracing this movement, too.  Unfortunately, there are some less-than-desirable side effects as the reward for presenting “statistical information” seems larger than ever.

Data snooping is the catch-all term used to describe a misuse of data mining techniques.   There are perfectly legitimate uses to data-mining, but data snooping is a big ‘no-no’ for the legitimate statistician.  If the researcher does not formulate a hypothesis before looking at the data, but instead uses the data to suggest what the hypothesis should be, then he or she is data snooping.

I’m guilty of data snooping, but (hopefully) only in a tongue-in-cheek fashion.   When I said Reggie Wayne was much better against blue teams than other opponents, that was data snooping.  We’ve all been taught that history repeats itself; that translates to “if the evidence indicates a strong relationship in the past, then it is likely to continue in the future” when it comes to statistical analysis.  For example, history tells us that first round picks will perform better, on average, then sixth round picks.  That’s both what the data suggest and an accurate statement.

But what happens when the data suggest that being born on February 14th or February 15th means a player is more likely to be a great quarterback?  After all, the numbers tell us that 14% of all the NFL’s 31,000-yard passers were born on one of those two days, which only account for 0.6% of the days of the year.  Just because history tells us that those dates are highly correlated with success — and the p-value would surely be very impressive — doesn’t mean that there is any predictive value in that piece of information.

In rebuking a different football hypothesis, Maurile Tremblay once wrote:

Such hypotheses are typically formed using all the data currently available – which means that there are no fresh data left to test them on. It is a fundamental rule of hypothesis-testing that, whenever possible, you should not use the same data to both formulate and test your hypothesis. A short example will illustrate why this is so. Suppose I roll a six-sided die 100 times and analyze the results. I will be able to find many patterns in the results of those 100 rolls. I may find, for example, that a three was followed by a six 40% of the time, or that a one was never followed by a six. Would you trust any such patterns to hold true over the next hundred rolls? You shouldn’t. If they do, it would just be coincidence. It is easy to find patterns by looking for them in a given set of data; but the test of whether those patterns are meaningful is whether they hold true in data that have not yet been examined.

I bolded that last sentence because it’s the most important one in this post. This concept is obvious when we talk about dice or birthdays or jersey color. We know there’s randomness there. But when we talk about football, rationality seems to go out the window.  If you are going to use all of your data when running a study, you must formulate your hypothesis beforehand. If you simply want to look for what variables correlate with success after mining through your data, then you’re data snooping.  If you want to present yourself as data snooping, then that’s fine. But once you’ve mined through all your data, you have no data left on which you can properly test your theory.

So what can you do if you want to data snoop?  One option is to split the data into two groups: this works well if you want to data snoop first and ask questions later.  Another option is to data snoop and then present the article as one of trivia. A third option is to critically analyze any relationship you discover without regard to p-values. Any football theory should be able to hold water without regard to significance tests. You should be able to explain your theory to a casual fan, and what the possible drawbacks and criticisms of your theory are, too.

What you can’t do is fall back on significance tests — that’s the Wyatt Earp problem. Or say “well even if you find this hard to believe, that’s what the numbers say!” Finding patterns is easy; finding meaningful relationships is hard.

  • Honestly, this may be my favorite post, Chase. Very important topic. Couldn’t agree more with needing to split the sample.

    • Chase Stuart

      Thanks Danny.

  • There seem to be a couple of national bloggers out there now who are making conclusions based on data “facts” when in truth their “facts” fit their personal biases perfectly. You wouldn’t care to share what prompted you to write this valuable article, would you?

    • Chase Stuart

      An e-mail from awhile back.

    • Perhaps someone tricked or forcibly coerced Chase into reading a Kerry Byrne piece…

      • I was actually thinking Scott Kacsmar & Kerry Byrne when I wrote that. ; )

  • Wade Iuele

    This makes perfect sense to me.

    But should the principle of snooping half the data, forming a hypothesis, and then testing it on the other half also be used in player evaluation?

    I’ve heard analysts like Ron Jaworski say “I watched every throw Player X made in college and here’s what I found.”. Should they watch half the throws, form a hypothesis, then watch the other half? I think what those analysts are doing is comparing the new data to all the historical data already in their heads. Is that a valid method?

    I fear the Jaworski method is what I use a lot for fantasy football. But I am open to change.

    • Neil

      Player evaluation is different. There’s no pretension of science. Statistics are at best misleading and at worst fraudulent for most purposes. Even if you follow Chase’s formula, the statistics you have right now have nothing to do with the statistics that will come in the future. There’s no predictability, no immutable rules in chaotic systems like the weather or football; thus no hypothesis to be made about either is scientifically falsifiable. The best book on this is The Future of Everything by David Orrell.

      Film evaluation’s just describing what you see. Again, no pretension of science or, more specifically, falsifiability. Evaluations may not have any bearing on the future of that player, but any reasonable person will realise that if your scouting report doesn’t hold up in the future, there are a million possible reasons why besides you don’t know what you’re looking at. Big contracts, coaching changes, the wrong woman, etc.

      • bengt

        I propose to be more careful with the usage of terms here. You should say ‘noisy’ or something similar here instead of ‘chaotic’ (i.e. the generic behaviour of some low dimensional nonlinear systems). First, of course there are immutable rules for chaotic systems, that’s why they are being studied. Second, football for sure is not a chaotic system.
        OTOH, finding a strange attractor in a nonlinear system of football-derived differential equations could win one the Nobel Price for Sports Sciences. 😉
        And the article is great. Precise, stringent, concise.

        • Neil

          Chaotic’s a better word for what I mean because I just don’t see how you can assume there’s anything immutable with football. But let me add another facet to that description. A phenomenon like football is too complex to be rendered mathematically. Surely, on some level it’s comprehensible in that one very, very complex thing is always happening to the exclusion of all other possibilities (this does not mean there are immutable rules), but a system of such complexity is well outside of the grasp of the human mind and would require modern computing power with many zeros added on to the end assuming humans could write the program that perfectly encapsulated football. And we haven’t even begun to quantify what thing(s) drive the evolution of the game and how. Where does the noise come from if not this? The actual rules of the game aren’t even immutable because the NFL tweaks a couple every season.

          Feel free to disagree with me. I just have no idea how the entirety of football could be reduced to something like the laws of physics. Besides the tautological like more points is always better…wait, football could find a way to be scored like golf. Unlikely, but not impossible.

          Not to be mean, your assumption that there are immutable rules for chaotic systems because people study them is circular reasoning. If I’m correct, this wouldn’t be the first thing humans studied using an incorrect set of assumptions to inform their methods.

          The one question I have about Chase’s data-splitting solution is how does this increase the predictive value of an opinion based on those data? That’s what he means by patterns being meaningful, right? The future will always come along, and besides the obvious like more points is better, I don’t see how any conclusion can be assumed to have predictive value based solely on the rigour of statistical analysis. Even if the prediction is right, it might have just been a lucky guess.

          The real missing piece to me is being able to explain why something is right. You can’t just say I looked at the data and used the proper methodology to divine the future from what happened in the past. For one thing, you’re not even using what happened in the past for your divination: you’re using a numerical abstraction of what happened in the past that captures barely a fraction of the totality of the event. On the other hand, your attempt at reasoning might be wrong; you might fail with your attempt at divining the future, but unless you try to know something completely and can explain it completely you have no chance of being right aside from dumb luck.

          • bengt

            The point I was trying to make is that ‘chaos’, short for ‘deterministic chaos’, is a term with a defined meaning(*). Nothing in Chase’s article is connected even a tiny bit to this discipline AFAICS. I have no knowledge of any work on the topic of ‘Chaotic behaviour in football data’, but it is certainly possible to try it.
            To get back on and contribute to the topic, one could take as a time series the number of carries per time period by an RB, generate a phase portrait by taking every n-th value and embed it into m dimensions, and calculate the Hausdorff dimension. Maybe one would then find chaotic signatures, meaning that some immutable rules of chaotic systems are applicable, but I consider it extremely unlikely.
            (*) E.g. ‘the weather’ is a noisy system because it has so many degrees of freedom – infinitely many as far as computing is concerned. But the system of three coupled differential equations that meteorologist Edward Lorenz developed when studying atmospheric convection http://en.wikipedia.org/wiki/Lorenz_equations is chaotic.

            • Neil

              From Lorenz himself:

              “Chaos: When the present determines the future, but the approximate present does not approximately determine the future.”

              Isn’t this what I’m saying? Again, feel free to disagree with my characterisation, but I think my diction is fine. Even with an inanimate system like the weather this is the case. Football is played by humans who sometimes like to get drunk on work nights. That quote doesn’t even capture my opinion that the past determines the future, but it’s impossible to tell how through the human mind; IE, we can’t comprehend why, how or how much the system evolves over time.

              Chase’s article is connected to this because he says he has found a formula for making statistic inferences have predictive value. Unless I’ve misunderstood his use of the word “meaningful”, but I’ve always thought that was code for predicting the future, and predicting the future was the holy grail of statistical analysis. I’m saying no such thing is possible because of the nature of the system he is applying his formula to.

              Why do we know that February 14th births can’t possibly result in star QBs? Because we can’t explain the causality deductively. Well, this planet lines up this way on that day of the year, exerting the tiniest bit of gravity on the uteruses of American mothers to be, changing the brain chemistry of the child to be adept at reading defenses…it’s preposterous. It has nothing to do with how you looked at the stats to find that conclusion. All stats do is describe the past (very poorly in most instances, in my opinion). Chase’s formula doesn’t change the fact that he’s trying to find “meaningful patterns” in past activity in a system that’s nothing like physics.

              • bengt

                No, this is what you were saying: “There’s no predictability, no immutable rules in chaotic systems like the weather or football;”
                There is predictability in chaos. “When the present determines the future” says exactly that. There are dynamical equations – or, put another way, immutable rules – that govern the system in question.The second part of the quote is a reformulation of the principle of exponential sensitivity to initial conditions. If you don’t know the present exactly, but only approximately well, you cannot expect to determine the future even approximately well, rather not at all. If you knew the present with infinite accuracy and had infinite computing power, you could in theory determine the future. Still, what appears to be random has some order even in the long run – the system stays on a strange attractor, which provides some sort of predictability by itself. But this has nothing to do with football, let alone with what Chase wrote in the article, which is about the importance of having a test group.
                Football and weather are stochastic, or noisy, or whatever, systems. But not chaotic. Yes, I know that it says ‘Chaotic behavior can be observed in many natural systems, such as weather’ in wikipedia right below the Lorenz quote. The important part is the ‘in’. That’s why I brought up the example of the Lorenz equations in my previous comment. They are a crude model of a tiny subpart of ‘the weather’, and they show chaotic behaviour.
                To summarize: I’m not arguing that football and weather are predictable. I’m arguing that they are not chaotic systems.

                • Neil

                  “The important part is the ‘in'”. So what percentage of the weather system is chaotic? 50%? Would not 1% be adequate for the classification? That 1% will confound the greatest minds for possibly aeons…

                  What he means by “when the present determines the future” is what I said here:

                  “Surely, on some level it’s comprehensible in that one very, very complex thing is always happening to the exclusion of all other possibilities”

                  That phrase applies equally to what the system does and how it changes what it does (the change might be zero; I don’t just assume this by asserting the presence of a “strange attractor”). Yes, something is happening, but can humans understand it? I don’t think this is a given even given infinite computing power. To use computing power you have to be able to mathematically abstract the system, which we can’t just assume is possible no matter how many massive brains divert themselves to the task. Given the quality of climate models, weather predictions, SackSEER (I had to look up the capitalisation, what dicks) and football video games, I think we have to strongly consider the chaotic possibility.

                  It seems to me your assumption of a “strange attractor” that “provides some sort of predictability by itself” “in the long run” is entirely taken on faith. Know any immortals who can shed light on these systems from an infinite timescale? I sure wish I did. Something like a strange attractor might exist, but we don’t know it does. And in the meantime football completely defies even the smartest people’s predictions about who wins a game; the weather…watch the weather channel try to predict this week. Sure, the weather channel can get pretty close, and NFL 2K5 looks like football, but what did Lorenz say again? Maybe our massive brains will solve these riddles and find our strange holy grail, but I wouldn’t take that outcome for granted.

                  Anyway, if you still don’t see how weather in particular at least could be a chaotic system, I don’t think we have anything else to talk about. You may as well believe in Jesus while I believe in Pan. Pan tells me Jesus was just a hippy, man.

                  I thank Chase for tolerating this. Somewhere in here I did have a rebuttal of your post. My ultimate verdict is your data-splitting principle accomplishes nothing because it doesn’t solve the problem of trying to infer the future from approximations of the past. I think this means any sort of statistical analysis that masquerades as anything more than a description of the past is at best a guess without a strong line of deductive reasoning to back it up. Sorry!

                  • bengt

                    Oooops, forgot to properly reply… Please take a look at the bottom of the thread.

  • I do find the timing rather funny that you come out with this article a day after my favorite podcast spends an entire segment explaining the problems of fishing expeditions in medical research. 🙂

    Excellent, excellent work, as usual.

  • Kibbles

    So basically… correlation does not imply causation?

    • Bowl Game Amonaly

      It would be more accurate to say that past correlation does not imply future correlation, if you found the correlation through data mining rather than through hypothesis testing.

  • Tim Truemper

    Going out on a limb here- 1) Nate Silver makes this point repeatedly in The Signal and the Noise- you have to have a theoretical explanation for your findings. There are many statistically found relationships but they may not have any meaning.; 2) Correlation can imply causation. It just doesn’t prove causation. 3) Not sure why statistics are meaningless and/or fraudelent. I’m no statistics expert by I took 2 graduate level stat courses and some test and measurement theory coursework as well. The thrust of quality statistics work is to know your limitations. Quantification is a guide, not the end all be all. But numbers provide a level of analysis that the trained observer may not be able to meet. Plus, trained observers (lets say scouts examining a player) can come up with widely different opinions. The popular account of these notions, Moneyball, supports the value of quantification as a helpful, nay vital tool. And yes, I read “How to Lie with Statistics”- got a 1971 print edition sitting on my shelf. The title of that book (and Mark Twain’s pithy quote) are overcited as a reason to dismiss statistical analysis. Ok Chase, you can cut me off now.

  • arayeph

    Speaking loosely in terms of advanced football analytics, I have to think the online community for creating such sophisticated predictive models and useful stats during the season and offseason. Not in terms of fantasy football, but who which teams are the best and have the best chances of winning the Super bowl. You don’t get this type of coverage in the regular media like ESPN, NFL Network or Yahoo Sports sites. Its very telling and sometimes may not be entirely accurate but gives a much better picture of what to look for when teams play each other and who really is good compared to what “X” talking head on TV thinks or whatever is posted on any NBC Universal, Disney or Gawker Media supported web sites/entity.

  • bengt

    “Surely, on some level it’s comprehensible in that one very, very complex thing is always happening to the exclusion of all other possibilities”
    I have no idea what that means, but I’m sure it doesn’t mean that chaotic systems are deterministic, which is what Edward Lorenz said (and with good reason).
    Are you trying to play devil’s advocate with me? Or are you mad at me because I corrected you?
    I have politely pointed out that your usage of the term ‘chaotic system’ is wrong and inadequate, because you appointed an incorrect attribute (‘no immutable rules’) and gave incorrect examples. Correcting it was a hint for you to give you a chance to avoid embarrassing yourself when discussing this topic in the future. By using nomenclature like ‘scientifically falsifiable’ and ‘chaotic systems’ you gave the impression that you had the intention to make a solid, rigorous statement. In the following discussion you appeared to be genuinely concerned about a proper description of the topic.
    What you did was the equivalent of calling a muffed punt a fumble. Not a big deal when you are watching at home with your family, but if you’re Phil Simms on national TV people will go apeshit. I’m sure you’ll admit that it largely enhances one’s credibility if he/she calls a muffed punt a muff.
    So, again, I propose to be more careful with the usage of terms here. You should say ‘noisy’ or something similar here instead of ‘chaotic’.
    I rarely comment on this site since I am in a different timezone and usually cannot provide anything new or original, but in this case I could, so I took the liberty to attempt to provide some new, useful, and educative information. I’m sorry that it hasn’t worked out.

    • Neil

      You’re just repeating the same points in the same way going on three or so times now. I’m not sure if you’re not reading what I wrote or if our backgrounds are just too different for my perspective to make sense to you. Quibbling with someone’s diction instead of earnestly trying to understand what they’re saying isn’t the move of someone who wants to have a debate in good faith though. It’s the move of someone who just needs to be right and will get really upset if that doesn’t happen. I’m sorry it hasn’t worked out.

      • bengt

        “You’re just repeating the same points in the same way going on three or so times now.”
        That’s because they’re the only points I want to make, and you still don’t get them. Instead you now pull out the strawman of calling me dogmatic and close minded.
        “Quibbling with someone’s diction instead of earnestly trying to understand what they’re saying isn’t the move of someone who wants to have a debate in good faith though.”
        Believe me, I made every effort on my part to avoid having to quibble. Like, explaining in my first comment why you were wrong. It would have suited you well to think to yourself something along the lines of ‘This guy seems to have some professional knowledge in this field, maybe he is on to something and I should question myself’.
        Trying to understand what you’re saying leads one to believe that first and foremost you don’t understand what others are saying. To wit:
        “Of course there are immutable rules for chaotic systems, that’s why they are being studied” (my quote) means that chaotic systems are being studied because there are immutable rules governing them. You understood it the other way around and taunted me with an accusation of circular reasoning.
        “Chase’s article is connected to this because he says he has found a formula for making statistic inferences have predictive value.” and your continuing usage of the February 14th ‘formula’ is off-topic. Chase says that the practice of data snooping is a bad one. He is obviously using the ‘formula’ tongue-in-cheek, he’s ridiculing bad researchers who are in danger of coming up with such nonsense because they are data snooping. He even writes “Just because history tells us that those dates are highly correlated with success … doesn’t mean that there is any predictive value in that piece of information”, which is the opposite of what you understood.
        “It seems to me your assumption of a “strange attractor” … is entirely taken on faith.” I do not assume them. Strange attractors do exist. They are precisely defined mathematical objects. They can be found in experimental data and in equations (I did both). They are a very good sign of a chaotic system. I described a method to look for them in a football-related data set. If you want to prove your claim that football is a chaotic system, the burden is on you to successfully perform that task.
        “A phenomenon like football is too complex to be rendered mathematically”. Here you are arguing that football is *not* a chaotic system.
        And frankly, I don’t want to debate someone who writes
        “Statistics are at best misleading and at worst fraudulent for most purposes”.
        on a website doing nothing but statistic evaluation of football. That is scathing, impudent and derogatory. Similarly, your ‘verdict’ is way off-base, too.
        “It’s the move of someone who just needs to be right”
        Yes, that must be it.
        I need a holiday now. I’ll leave the last word to you.