Nathaniel Ward

When data analysis goes wrong

Writing in Wired, Gary Smith explains the dangers of mindless data mining:

The Feynman trap—ransacking data for patterns without any preconceived idea of what one is looking for—is the Achilles heel of studies based on data mining. Finding something unusual or surprising after it has already occurred is neither unusual nor surprising. Patterns are sure to be found, and are likely to be misleading, absurd, or worse.

This approach can generate spurious correlations—patterns that are true but meaningless. More Smith:

In 2011, Google created an artificial intelligence program called Google Flu that used search queries to predict flu outbreaks. Google’s data-mining program looked at 50 million search queries and identified the 45 that were the most closely correlated with the incidence of flu. It’s yet another example of the data-mining trap: A valid study would specify the keywords in advance. After issuing its report, Google Flu overestimated the number of flu cases for 100 of the next 108 weeks, by an average of nearly 100 percent. Google Flu no longer makes flu predictions.

A better approach to data analysis, one that will return the most useful insights, starts with questions to answer and hypotheses to validate.