≡ Menu

Re-Post: Data Snooping

In lieu of weekend trivia, I am going to begin reposting old articles that I think people would still find relevant. If you’re a new reader, I hope you enjoy; if you’re an old-timer, my hunch is you still will get something new out of reading these again, just like I did. Today’s post is on Data Snooping, originally posted by me in June 2013.

Reggie Wayne dominates when seeing blue

Reggie Wayne dominates when seeing blue.

Over the last few years, the football analytics movement has made tremendous progress. There are many really smart people with access to a large amount of useful information who have helped pioneer the use of statistics and analytics in football. Major news organizations and NFL teams seem to be embracing this movement, too. Unfortunately, there are some less-than-desirable side effects as the reward for presenting “statistical information” seems larger than ever.

Data snooping is the catch-all term used to describe a misuse of data mining techniques. There are perfectly legitimate uses to data-mining, but data snooping is a big ‘no-no’ for the legitimate statistician. If the researcher does not formulate a hypothesis before looking at the data, but instead uses the data to suggest what the hypothesis should be, then he or she is data snooping.

I’m guilty of data snooping, but (hopefully) only in a tongue-in-cheek fashion. When I said a couple of years ago that Reggie Wayne was much better against blue teams than other opponents, that was data snooping. It was a factually accurate statement given the sample studied, but there was no reason to think it would repeat itself over a larger data.

The reason data snooping can be tricky is that we like to think that, on a general level, history repeats itself. When performing statistical analysis, that translates to “if the evidence indicates a strong relationship in the past, then it is likely to continue in the future.” For example, history tells us that first round picks will perform better, on average, then sixth round picks. That’s both what the data suggest and an accurate statement.

But what happens when the data suggest that being born on February 14th or February 15th means a player is more likely to be a great quarterback? After all, the numbers tell us that 14% of all the NFL’s 30,000-yard passers were born on one of those two days, which only account for 0.6% of the days of the year. Just because history tells us that those dates are highly correlated with success — and the p-value would surely be very impressive — doesn’t mean that there is any predictive value in that piece of information.

In rebuking a different football hypothesis, Maurile Tremblay once wrote:

Such hypotheses are typically formed using all the data currently available – which means that there are no fresh data left to test them on. It is a fundamental rule of hypothesis-testing that, whenever possible, you should not use the same data to both formulate and test your hypothesis. A short example will illustrate why this is so. Suppose I roll a six-sided die 100 times and analyze the results. I will be able to find many patterns in the results of those 100 rolls. I may find, for example, that a three was followed by a six 40% of the time, or that a one was never followed by a six. Would you trust any such patterns to hold true over the next hundred rolls? You shouldn’t. If they do, it would just be coincidence. It is easy to find patterns by looking for them in a given set of data; but the test of whether those patterns are meaningful is whether they hold true in data that have not yet been examined.

I bolded that last sentence because it’s the most important one in this post. This concept is obvious when we talk about dice or birthdays or jersey colors. We know there’s randomness there. But when we talk about football, and a less absurd pattern, it’s much easier to be confused. If you the researcher plan on using all of your data when running a study, you must formulate your hypothesis beforehand. If you simply want to look for what variables correlate with success after mining through your data, then you’re data snooping. If you want to present yourself as data snooping, then that’s fine. But once you’ve mined through all your data, you have no data left on which you can properly test the repeatability of your theory.

So what can you do if you want to data snoop? One option is to split the data into two groups: this works well if you want to data snoop first and ask questions later. Another option is to data snoop and then present the article as one of trivia. A third option is to critically analyze any relationship you discover without regard to p-values. Any football theory should be able to hold water without regard to significance tests. You should be able to explain your theory to a casual fan, and what the possible drawbacks and criticisms of your theory are, too.

What you can’t do is fall back on significance tests — that’s the Wyatt Earp problem. Or say “well even if you find this hard to believe, that’s what the numbers say!” Finding patterns is easy; finding meaningful relationships is hard.