Torture the data long enough and it will talk.

Data science involves a lot more than numbers and math. There’s a very human side.

For instance, human brains like to force interpretations on data. It’s very useful and even fun, but it often leads us astray. This is especially true in data science, and we have to be very careful in assessing our work to make sure we’re not seeing things that aren’t there, or missing things that are. Sometimes a cigar is just a cigar.

How much data is enough?

Of course we need data – without enough of it our analyses have insufficient power.

But there can be too much data. For one, it’s expensive. And given enough, we come up with Spurious Correlations, often due to practices like data dredging. More under “apophenia“.

Are these things really related?

Did you hear the one about the diapers and the beer? Put them close together in the store, sell more of both! There were theories about this, and some were plausible, but is this real or just a coincidence?

There are questions like “is there a causal relationship between IQ and shoe size? Maybe there is, but what would we do with this information? Will people start walking around in clown shoes, start binding their feet, or have surgical modifications? Whatever value such research might have, surely some questions have higher priority than others.

What does it mean?

Interpretation of results can be highly subjective. Who remembers Droodles? This image has been around for a while – what do you see? You’ve probably heard of Rorschach tests. When you look at this, do you see two stick figures dancing or something else?

How about caption contests, or creating memes? We do need to be able to generate hypotheses to test, and sometimes we need to be creative. But in data science it’s no joke – lots of money can be riding on interpretations of data.

Lots of money leads to lots of commercial pressure, with questions like “are you sure this product isn’t safe?” Think of the pressures on engineers, executives and regulators involving the Boeing 737 Max about now.

There are also political pressures that can warp interpretations. Government funding should be as suspect as any other because it is always subject to political pressures. Research that could be interpreted as favoring a particular sub population can get you in trouble too, no matter how convincing the evidence. Read Carl Zimmer’s recent book for some examples of how early eugenicists, who contributed much to the development of today’s statistics, also influenced governments to apply policies we don’t accept today.

Under those and other pressures, you can be tempted to “torture” your data, making it say whatever you want to hear.

In short, data scientists are human beings ultimately working for other human beings, and have to be careful in how they collect, analyze and interpret data with that in mind. Failures can be amusing, costly, or even deadly.

Author: dsnovice

Engineer, Toastmaster, healthcare analyst, data science novice, web development novice

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.