Dollars and Jens

Monday, May 07, 2007

causation, correlation, and statistics

Until 2004, the Washington Redskins appeared to have a tremendous ability to predict the future. Prior to that year's election, since 1936, the Redskins' performance in their last home game before the election accurately predicted the party that would gain control of the White House. That's a pretty interesting piece of data mining, but as the 2004 election showed, it had no real influence on the outcome.
...

People who believed that the game could predict the outcome of the election were committing a common logical error. They believed that correlation was causation.

Correlation isn't causation; as this incident shows, it's not even correlation.

More often, the confusion between correlation and causation is a direct one; A is correlated with B, so A causes B. Often someone will realize that, if A is correlated with B, then B is equally correlated with A, so B may well cause A; in fact, in light of le Chatlier's principle, I've developed a tendency, when I see a correlation with a sign opposite to what I'm expecting, to look at whether the causal relationship might primarily run the direction opposite what I was first thinking. More often, though, it's noted that both could be effects of some third cause. What often gets dropped is the notion that perhaps it's coincidence.

Warren Buffett, some nearly integer number of years back, noted that in insurance you don't care about causal relationships, only about the ensuing correlations. If X and Y are positively correlated, and someone with a lot of X wants to insure against Y, you'd better charge more than you would someone without a lot of X; it doesn't matter whether some risk factor for Y causes X or whether X causes Y. He was, however, referring to situations in which there is reason to believe in a future correlation between X and Y, which may be different from whether a small sample of past events has itself exhibited such a correlation. Having an understanding of the causal relationship underlying a correlation can help one to believe that it is likely to be sustained, and is not simply coincidence. Of course, if you have oodles of statistics for a comparatively uncontrived relationship, it can make sense not to assume there isn't a correlation just because you can't figure out why there should be.

Indeed, it's the case that you may, with insufficient statistics, not see a pattern where there is one. For example, if you follow that link, you can witness an attempt to refute a broadly plausible model of modern finance with seven data points which the model itself would expect to have far more noise than signal. There may be a (nonzero) correlation, or there may not be; with this kind of information set, it's probably safest simply to acknowledge ignorance.

- posted by dWj @ 10:55 PM