"All models are wrong, but some are useful." -- George Box, StatisticianChris Anderson writing in Wired Magazine on the rise of the "Petabyte Era" and the end of theory:
Read the whole thing.
Sensors everywhere. Infinite storage. Clouds of processors. Our ability to capture, warehouse, and understand massive amounts of data is changing science, medicine, business, and technology. As our collection of facts and figures grows, so will the opportunity to find answers to fundamental questions. Because in the era of big data, more isn't just more. More is different.
This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.
The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.
Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete... Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
While there is some merit and much to ponder in this article, I find myself skeptical of some of the conclusions. In the article, Chris Anderson is himself purporting his own theory about what trends in data storage and analysis mean for science. He is utilizing unarticulated mental models to come up with his own interpretation of these trends. It is certainly possible to find many correlations in data that often confirm our own biases -- particularly in the social sciences.
To quote the late Milton Friedman:
"The cost of regressions has been falling while the cost of thinking has remained constant. Think about the substitution effect!"In the physical sciences, our data are only as good as our instrumentation. More broadly speaking, our data are only as good as our measurements. We face both technological and methodological challenges in data collection that will create limitations on what we can and can't measure and how accurately we can do so. This may skew the data (and non-theoretical conclusions reached using it) tremendously. This is particularly true in fields such as medicine and economics where it is difficult or impossible (and often morally reprehensible) to conduct experiments that will yield conclusive results about cause-and-effect relationships. Confirmation bias may still rule the day in the way programs are written and data analysis is done.
Additionally, theory often points towards what data to monitor. Given the constraints of not being able to place sensors everywhere for everything, theories will give researchers ideas for what to look for in the data. They may often find other relationships, but they wouldn't stumble onto those if they weren't already looking in that direction for something else.
While I think the theory that theory is dead is an overstatement, I think it is based on some fascinating trends and changes in data analysis. While theory may not yet be dead, Anderson's article is well worth reading in its entirety and gives pause to think about how computing may further revolutionize the manner in which much research is done.
Here are some areas of research which are changing due to the availability of massive amounts of data:
- Feeding the Masses
- Chasing the Quark
- Winning the Lawsuit
- Tracking the News
- Spotting the Hot Zones
- Sorting the World
- Watching the Skies
- Scanning Our Skeletons
- Tracking Air Fares
- Predicting the Vote
- Pricing Terrorism
- Visualizing Big Data