Regression and data modeling: a chess game with nature

We can never know the whole truth of inscrutable Nature, we can only test our postulates against data, and improve them over time. But there are pitfalls. Learn strategic thinking!

The Dialogue


I postulate that $w = w_0$, a constant, everywhere and at all times. Data can be used to test this postulate.


Ha ha, foolish mortal, your Error is very large! It is the difference between my mysteries and your puny imagination.

$\epsilon = w - w_0$

You must swallow that error, with each bite of it weighted by its greatness of size, whether negative or positive. How humiliating for you.

$\epsilon^2 = (w - w_0)^2$


Yes, for my folly I will swallow the square of my admittedly great errors. But I will minimize the pain by choosing $w_0$ to minimize that squared error. The condition of that minimization is:

$\frac{d}{d w_0}\overline{\epsilon^2} = 0$

By substituting $w = \overline{w} + w'$, where $\overline{w}$ is the "mean" I earlier defined for my convenience (along with median and mode and numerous other possibly-meaningless statistics), I can make clear the answer:

$\frac{d}{d w_0} \overline{\epsilon^2} = \frac{d}{d w_0} \overline{(\overline{w} + w' - w_0)^2} =0 $

Now keeping only the terms that have $w_0$ in them (so their derivative will be nonzero) in expanding the square,

$\frac{d}{d w_0}({ -2\overline{w}w_0 + \overline{w'w_0} + {w_0}^2) } =0$

Since the middle term is zero by definition of mean and anomaly, the solution is

$ -2\overline{w} + 2w_0 = 0$

or obviously $w_0 = \overline{w}$. This guess minimizes squared error among all possible constants I could choose as a (yes, too simplistic) model for w.


Clever, for a puny mortal with so little imagination about the true nature of w.

Next idea: Hey, hot air rises, so maybe w can be "predicted by" its association with T.

The Dialogue


I now postulate that $w = w_0 + \alpha T'$, a line in w vs. T' space. I removed the mean of T, since $\alpha \overline{T}$ would just be another constant, perhaps an interpretation (or more likely a confusion) of $w_0$.


Foolish mortal, your humiliation is still the SQUARE of the difference between my mysteries and your puny imagination.

$\epsilon^2 = (w - (w_0 + \alpha T'))^2$


I will again minimize the pain by choosing $w_0$ and $\alpha$ to minimize mean squared error (MSE). Your simple parabolic error surface still has only one minimum, so by partial derivatives I have TWO conditions to find it.

$\frac{\partial}{\partial w_0}\overline{\epsilon^2} =0 $

$\frac{\partial}{\partial \alpha}\overline{\epsilon^2} =0 $

The first condition is unchanged from our first game, because $\overline{w_0 \alpha T'} = 0$ for ANY constants $w_0,\alpha$. The mean term $w_0$ and the linear term $\alpha T'$ are orthogonal: the sum of their product vanishes. So again $w_0 = \overline{w}$, just as in our first game.


Clever move, mortal. But what about your $\alpha$?


Hmm, I shall have to think. Or just do algebra and calculus, no thinking required! My slave math neurons will do the work.

$\frac{\partial}{\partial \alpha} \overline {(\overline{w} + w' - w_0 - \alpha T')^2} =0$

We can immediately simplify since $w_0 = \overline{w}$. Ha ha, this is too easy!

$\frac{\partial}{\partial \alpha}{ \overline {(w'- \alpha T')^2} } =0 $

$\frac{\partial}{\partial \alpha}{ \overline {-2 \alpha w'T' + \alpha^2 T'T'} } =0 $


$ -2 \overline{w'T'} +2 \alpha \overline{T'T'} =0 $

so that

$\alpha = \overline{w'T'}/\overline{T'T'} = cov(w,T)/cov(T,T) = cov(T,w)/var(T) = cov(T,w)/{\sigma_T}^2 = corr(w,T){\sigma_T}{\sigma_w}/{\sigma_T}^2 = corr(w,T) \frac{\sigma_w}{\sigma_T}$


Your units check out, mortal, and of course correlation is the measure of the relationship and must appear. It seems you can do algebra, whoopee for you.

But your imagination is still puny.

The Counterattack


Silly mortal. w and T are related because rising air cools, or warms from getting nearer the sun, not because hot air rises!

I counter-postulate that $T = \overline T + \beta w'$, a line in T vs. w space. I removed the means, since as we have learned, averaging is orthogonal to line fitting. I follow your steps, and find that $ \beta = corr(w,T) \frac{\sigma_w}{\sigma_T}$. Do you now notice that this is not the inverse of your $\alpha$? What say you to this, fool?


Your spell of confusion is strong, o wily one.

On the one hand, my postulate of T dependence, $w'=\alpha T'$, explained a fraction of var(w):

$fraction\_explained = \alpha^2 \overline{T'T'} / \overline{w'w'} =corr^2(T,w)$

with the remainder being in your inscrutable Error variance.

On the other hand, your confusion-inducing counter-postulate $T'=\beta w'$ explained the same fraction of var(T)!

$fraction\_explained = \beta^2 \overline{w'w'} / \overline{T'T'} =corr^2(w,T)$

Yet your slopes is not the inverse of mine, $\alpha \neq \beta^{-1}$, as I would get from solving my postulated relationship $w'=\alpha T'$. I feel suddenly dizzy.

If skill in explanation is our only measure of the truth of our postulates of what "explains" what, this is a draw, I must admit.


Indeed, mortal. Mind your puny "postulates" and your storytelling notions! Correlation is not causation, for the thousandth time! Forget ye not this humor:

My workings are not to be unlocked with correlation, only glimpsed. Unless you use lagged correlation...


Lag, hmmm. Perhaps sequence is causation... Hmm...


You take hints well, mortal. Our next game will be more amusing. With multiple non-orthogonal variables I shall weave another spell of confusion for you, and laugh well.

Let's fit some lines and see why reg(T,w) and reg(w,T) differ