Tuesday, July 08, 2008

Is The Scientific Method Obsolete?


"All models are wrong, but some are useful." -- George Box, Statistician
Chris Anderson writing in Wired Magazine on the rise of the "Petabyte Era" and the end of theory:

Sensors everywhere. Infinite storage. Clouds of processors. Our ability to capture, warehouse, and understand massive amounts of data is changing science, medicine, business, and technology. As our collection of facts and figures grows, so will the opportunity to find answers to fundamental questions. Because in the era of big data, more isn't just more. More is different.

This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.

Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete... Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
Read the whole thing.

While there is some merit and much to ponder in this article, I find myself skeptical of some of the conclusions. In the article, Chris Anderson is himself purporting his own theory about what trends in data storage and analysis mean for science. He is utilizing unarticulated mental models to come up with his own interpretation of these trends. It is certainly possible to find many correlations in data that often confirm our own biases -- particularly in the social sciences.

To quote the late Milton Friedman:
"The cost of regressions has been falling while the cost of thinking has remained constant. Think about the substitution effect!"
In the physical sciences, our data are only as good as our instrumentation. More broadly speaking, our data are only as good as our measurements. We face both technological and methodological challenges in data collection that will create limitations on what we can and can't measure and how accurately we can do so. This may skew the data (and non-theoretical conclusions reached using it) tremendously. This is particularly true in fields such as medicine and economics where it is difficult or impossible (and often morally reprehensible) to conduct experiments that will yield conclusive results about cause-and-effect relationships. Confirmation bias may still rule the day in the way programs are written and data analysis is done.

Additionally, theory often points towards what data to monitor. Given the constraints of not being able to place sensors everywhere for everything, theories will give researchers ideas for what to look for in the data. They may often find other relationships, but they wouldn't stumble onto those if they weren't already looking in that direction for something else.

While I think the theory that theory is dead is an overstatement, I think it is based on some fascinating trends and changes in data analysis. While theory may not yet be dead, Anderson's article is well worth reading in its entirety and gives pause to think about how computing may further revolutionize the manner in which much research is done.

Here are some areas of research which are changing due to the availability of massive amounts of data:
More thoughts from Andrew Gelman.

2 comments:

thinking said...

I agree with you Dr Bri...large scale data analysis will be another tool to be used in the sciences, not necessarily a wholesale replacement of "theory" or the scientific method.
No doubt this will bring many major breakthroughs and insights.

I find the problem with articles like these in popular magazines like "Wired" is that they tend to oversimplify and over exaggerate to create sales. I think of all of the very hyperbolic articles in Wired during the height of the tech bubble and laugh.

I remember all the articles about the new rules for the new economy, and of course, when the tech bubble came crashing down, it showed that some of the rules had not changed at all.

Wired is a great magazine, fun to read, but like all popular magazines, tends to hype up stuff moreso than is justified.

Aaron Blaisdell said...

Anything that claims to be the greatest thing since sliced bread probably ain't. Although the article makes some excellent points about the value of the new data-driven approaches allowed by the collection, storage, and sifting through massive amounts of data, this type of tool is not useful to every branch of science, nor does it displace theory.
My are of research is animal cognition. I find it difficult to draw conclusions about how rats form mental maps or engage in causal reasoning without carefully crafted experiments designed to test hypotheses. In fact, the fields of computer science, statistics, philosophy of science, and psychology have been going through a revolution of sorts centered around the idea that interventions (e.g., experimental manipulations) give us strong insight into the causal models in our world (solving Hume's dilemma of causality being real but only indirectly). This new (but old) framework only reaffirms Popperian science.