As you know, I love fact-based marketing. One of the main ways marketers gather facts is by looking at the statistics of customers’ behaviors. Oftentimes that works great. A/B split testing is a perfect example of statistics-based fact gathering that works really well. You can determine which price point makes you more money (to use my earlier example) and count that knowledge as highly reliable, even if you don’t understand why it’s the case.
Another fact gathering method is what they call data mining. Essentially that means taking an established business, community, or crowd of any sort and looking for patterns and trends in its behavior. Marketers often data mine by looking at the customer base or transaction records for a certain period of time and trying to discern markers or behaviors that directly correlate to the behaviors they ultimately want. For example, customers with this one set of characteristics are more likely to buy an upgrade than customers with this different set of characteristics. Marketing and sales managers often view these correlative discoveries as gold strikes which will yield great dividends for the business.
Certainly data mining is a key tool in fact-based marketing, but it’s laced with danger. For this series I’m going to focus one of these dangers, the difference between causation and correlation. I watch lots of business people engage in data mining exercises, and yet I often find that I’m the only one in the room who is alert to the difference between these two qualities, even though the difference is of Earth shattering importance. Let’s start with definitions.
- Causation: The relationship between a pair of events or circumstances in which one of them is the direct creator (or cause) of the other.
- Correlation: The relationship between a pair of events or circumstances in which the presence of one affects the likelihood of the presence of the other. In the case where two circumstances are less likely to occur in tandem than the average, we tend to refer to it as negative correlation.
The difference is subtle but extremely important. In a causal relationship, event A brings about event B (or makes it more likely). In a purely correlative relationship, event A does not necessarily bring about event B. Perhaps event B brings about event A. Or perhaps both are brought about by a separate event C.
As it happens, causal relations are always correlative relationships. Causation is a subset of correlation. It also happens that, unless you believe in magic, correlative relationships imply some kind of causal relationship somewhere. It just might not be the one we’re looking at now.
Example: In our house we have two dogs, Bruce and Max. Bruce and Max tend to run around on all fours, while my wife and I tend to walk around on two feet. Likewise, my wife and I have well developed frontal lobes, while the dogs’ foreheads are pretty much full of skull. If we looked at large numbers of dogs and humans we could see these two patterns consistently occur. Therefore we can conclude that there exists a correlation – when looking at a population of canine and human adults – between possessing a well developed frontal lobe and walking around on two feet. However, anyone with even a small knowledge of biology will know that the frontal lobe does not cause the walking on two feet nor that walking on four legs somehow eliminates the frontal lobe. Instead we know there are other causes of both these syndromes, ultimately going back to a root cause, which in this case is differences in DNA between the two species.
In the above example, it seems pretty obvious. But I routinely watch business people mine their data and then jump to conclusions without ever asking this question. Next time, in part 2, a (conjectured) real-world example.