Data snooping is the catch-all term used to describe a misuse of data mining techniques. There are perfectly legitimate uses to data-mining, but data snooping is a big ‘no-no’ for the legitimate statistician. If the researcher does not formulate a hypothesis before looking at the data, but instead uses the data to suggest what the hypothesis should be, then he or she is data snooping.
I’m guilty of data snooping, but (hopefully) only in a tongue-in-cheek fashion. When I said Reggie Wayne was much better against blue teams than other opponents, that was data snooping. We’ve all been taught that history repeats itself; that translates to “if the evidence indicates a strong relationship in the past, then it is likely to continue in the future” when it comes to statistical analysis. For example, history tells us that first round picks will perform better, on average, then sixth round picks. That’s both what the data suggest and an accurate statement.
But what happens when the data suggest that being born on February 14th or February 15th means a player is more likely to be a great quarterback? After all, the numbers tell us that 14% of all the NFL’s 31,000-yard passers were born on one of those two days, which only account for 0.6% of the days of the year. Just because history tells us that those dates are highly correlated with success — and the p-value would surely be very impressive — doesn’t mean that there is any predictive value in that piece of information.
In rebuking a different football hypothesis, Maurile Tremblay once wrote:
Such hypotheses are typically formed using all the data currently available – which means that there are no fresh data left to test them on. It is a fundamental rule of hypothesis-testing that, whenever possible, you should not use the same data to both formulate and test your hypothesis. A short example will illustrate why this is so. Suppose I roll a six-sided die 100 times and analyze the results. I will be able to find many patterns in the results of those 100 rolls. I may find, for example, that a three was followed by a six 40% of the time, or that a one was never followed by a six. Would you trust any such patterns to hold true over the next hundred rolls? You shouldn’t. If they do, it would just be coincidence. It is easy to find patterns by looking for them in a given set of data; but the test of whether those patterns are meaningful is whether they hold true in data that have not yet been examined.
I bolded that last sentence because it’s the most important one in this post. This concept is obvious when we talk about dice or birthdays or jersey color. We know there’s randomness there. But when we talk about football, rationality seems to go out the window. If you are going to use all of your data when running a study, you must formulate your hypothesis beforehand. If you simply want to look for what variables correlate with success after mining through your data, then you’re data snooping. If you want to present yourself as data snooping, then that’s fine. But once you’ve mined through all your data, you have no data left on which you can properly test your theory.
So what can you do if you want to data snoop? One option is to split the data into two groups: this works well if you want to data snoop first and ask questions later. Another option is to data snoop and then present the article as one of trivia. A third option is to critically analyze any relationship you discover without regard to p-values. Any football theory should be able to hold water without regard to significance tests. You should be able to explain your theory to a casual fan, and what the possible drawbacks and criticisms of your theory are, too.
What you can’t do is fall back on significance tests — that’s the Wyatt Earp problem. Or say “well even if you find this hard to believe, that’s what the numbers say!” Finding patterns is easy; finding meaningful relationships is hard.