Adventures in econometricland

I took Oxford’s advanced undergraduate econometrics course. My experience of the course, and really of the entire field, was the following: the concepts are simple, the real challenge is making sense of notation so obfuscatory that you wonder if it’s done on purpose.

In order to arrive at this view, I went through a long and confusing journey, one I wish upon no friend. This document’s structure takes my journey in reverse order.1 I start with what I eventually pinned down as the clear mathematical facts. Once armed with this toolkit, I do my best to explain why standard notation is confusing, and attempt to guess, from context, what econometricians actually mean.

The examples I give really are of standard practice. I give quotes from a few textbooks and our lecture slides, but I promise that you will find the same thing almost everywhere. And the confusing usages are not just a convenience of notation that is readily acknowledged in conversation. When I asked people about this in person, all I got were long, confusing back-and-forths.

Contents

  1. The facts
    1. Preliminaries
    2. The CEF minimises wi2\sum w_i^2
      1. Some algebraic facts
      2. Minimisation problem
    3. The LRM minimises (ei+wi)2\sum (e_i + w_i)^2
      1. Some algebraic facts
      2. Minimisation problem
  2. Comments
  3. The hermeneutics (I)
    1. Inconsistent hats
      1. Sample analogues?
      2. Loss function minimisers?
    2. Confusing hats
  4. Inconsistent causal language
    1. The causal claim
    2. Hermeneutics (II)
      1. Algebra or causes?
      2. Knowns or unknowns?
  5. Appendix A
  6. Appendix B
  7. Appendix C

The facts

Preliminaries

We start with a set of ordered pairs {X1,Y1,X2,Y2,X3,Y3,...,Xn,Yn}\{ \langle X_1 , Y_1 \rangle,\langle X_2 , Y_2 \rangle,\langle X_3 , Y_3 \rangle, ..., \langle X_n,Y_n \rangle\}.

You can think of XiX_i and YiY_i as

  • real numbers (facts about each of the the nn individuals in the population)
  • or as random variables (probability distributions over facts about nn individuals in a sample),

all the maths will apply equally. (I will return to this fact and comment on it).

The CEF minimises wi2\sum w_i^2

Some algebraic facts

We write the equality:

Yi=f(Xi)+wiY_i = f(X_i) + w_i

Where YiY_i and XiX_i are known, but wiw_i depends on our choice of ff.

Minimisation problem

Suppose we want to solve

minf(Xi)wi2minf(Xi)(Yif(Xi))2\min_{f(X_i)} \sum w_i^2 \leftrightarrow \min_{f(X_i)} \sum(Y_i-f(X_i))^2

The solution is f(Xi)=E[YiXi]f(X_i)=E[Y_i \mid X_i]. The proof of this is in appendix A. Suppose we specify f(Xi)f(X_i) as such, we then get:

Yi=E[YiXi]+wiY_i = E[Y_i \mid X_i] + w_i

Now ff is known and wiw_i is known (by the subtraction wi=YiE[YiXi]w_i = Y_i - E[Y_i \mid X_i]).

The LRM minimises (ei+wi)2\sum (e_i + w_i)^2

Some algebraic facts

Now we write the following equality:

E[YiXi]=β0+β1Xi+eiE[Y_i \mid X_i] = \beta_0 +\beta_1 X_i + e_i

This says that E[YiXi]E[Y_i \mid X_i] is equal to a linear function of XiX_i plus some number eie_i.

We then have

Yi=E[YiXi]+wi=β0+β1Xi+ei+wi\begin{array}{ll} Y_i &= E[Y_i \mid X_i] + w_i \\ &= \beta_0 +\beta_1 X_i + e_i + w_i \end{array}

As before wiw_i is known, whereas eie_i is a function of β0\beta_0 and β1\beta_1.

Here eie_i is the distance, for observation ii, between the LRM and the CEF; while wiw_i is the distance between the CEF and the actual value of YiY_i. We can then call ui=ei+wiu_i = e_i + w_i the distance between the LRM and the actual value.2

We can also see that E[uixi]=0E[u_i \mid x_i] =0 is equivalent to ei=0e_i=0, i.e. the CEF and the LRM occupy the same coordinates.

Minimisation problem

Suppose we want to solve

minβ0,β1(ei+wi)2minβ0,β1(Yiβ0β1Xi)2\min_{\beta_0, \beta_1} \sum (e_i + w_i)^2 \leftrightarrow \min_{\beta_0, \beta_1} \sum (Y_i - \beta_0 -\beta_1 X_i)^2

The solution is

β0=yˉβ1xˉβ1=i=1n(yiyˉ)(xixˉ)i=1n(xixˉ)2\begin{aligned} \beta_0 &= \bar{y}- \beta_1\bar{x} \\ \beta_1 &= \frac{\sum_{i=1}^n(y_i-\bar{y})(x_i- \bar{x})}{\sum_{i=1}^n (x_i- \bar{x})^2} \end{aligned}

I prove this in appendix B. (It’s possible to prove an analogous result in general using matrix algebra, see appendix C.)

Suppose we specify that β0\beta_0 and β1\beta_1 are equal to these solution values. Now that β0\beta_0 and β1\beta_1 are known, eie_i is known too (by the subtraction ei=E[YiXi]β0β1Xie_i = E[Y_i \mid X_i] - \beta_0 -\beta_1X_i). As before, wiw_i is known.

Thus, in our regression equation,

Yi=β0+β1Xi+ei+wi=β0+β1Xi+ui\begin{aligned} Y_i &= \beta_0 + \beta_1 X_i + e_i + w_i \\ &= \beta_0 + \beta_1 X_i + u_i \end{aligned}

all of YiY_i, XiX_i, β0\beta_0, β1\beta_1, eie_i and wiw_i (and thus uiu_i), are known.

Comments

Two things to note about the facts above.

  • Whether we are using real numbers of random variables does not matter for anything we’ve said so far. All we have used are the expectation and summation operators and their properties. Textbooks often warn about the important distinction between the sample and the population, but as far as these algebraic facts are concerned the difference is immaterial!
  • I have not used “hat” notation (as in β^\hat{\beta}). Instead I have described the results of optimisation procedures carefully using words, like “the solution to this minimisation problem is …”. The way standard econometrics uses the hat is a good example of obfuscatory notation.

The hermeneutics (I)

Inconsistent hats

Econometrics textbooks, within the same sentence or paragraph, routinely use the hat in two ways which seem to me to be incompatible.

Sample analogues?

In Stock and Watson, p. 158, we have Claim A:

The linear regression model is:

Yi=β0+β1Xi+uiY_i = \beta_0 + \beta_1 X_i + u_i

Where β0+β1X\beta_0 + \beta_1 X is the population regression line or population regression function, β0\beta_0 is the intercept of the population regression line, and β1\beta_1 is the slope of the population regression line.

And Stock and Watson, p. 163 gives Claim B:

The OLS estimators, β0^\hat{\beta_0} and β1^\hat{\beta_1}, are sample counterparts of the population coefficients β0\beta_0 and β1\beta_1. Similarly, the OLS regression line β0^+β0^X\hat{\beta_0} + \hat{\beta_0}X is the sample counterpart of the population regression line β0+β1X\beta_0 + \beta_1 X and the OLS residuals ui^\hat{u_i} are sample counterparts of the population errors uiu_i.

So far so good.

Loss function minimisers?

Stock and Watson, p. 187 (Claim C):

The OLS estimators, β0^\hat{\beta_0} and β1^\hat{\beta_1} are the values of b0b_0 and b1b_1 that minimise i=1n(Yib0b1Xi)2\sum_{i=1}^n (Y_i - b_0 - b_1 X_i)^2.

This quote is the biggest culprit. After many conversations, I finally understood that we’re supposed to take the quote to mean:

The OLS estimators, β0^\hat{\beta_0} and β1^\hat{\beta_1} are the values of b0b_0 and b1b_1 that minimise i=1j(Yisampleb0b1Xisample)2\sum_{i=1}^j (Y^{sample}_i - b_0 - b_1 X^{sample}_i)^2, where jj is the number of observations in the sample (j<nj<n if nn is the sample size) and YisampleY^{sample}_i and XisampleX^{sample}_i are the iith values in the sample.

I swear, I’m not taking this quote out of context! Nowhere, in the entire textbook, would you find a clue that the XiX_i and YiY_i in claim C are completely different quantities than XiX_i and YiY_i in claim A. This is criminal negligence. (I’m also not cherry-picking. My lecture notes cheerfully call β0^\hat{\beta_0} and β1^\hat{\beta_1} the ‘OLS’ solutions, and this usage is standard.)

Of course, I’m the kind of person to take claim C at face value, and combine it with claim A, to arrive at β0=β0^\beta_0 = \hat{\beta_0} and β1=β1^\beta_1= \hat{\beta_1}, which, I gathered from context, was not a desirable conclusion.

Confusing hats

The following is not as bad as the above, since it avoids explicit contradiction, but still sows confusion by using the hat to mean different things when put on top of different values.

Claim D, from Stock and Watson p. 163:

The predicted value of YiY_i, given XiX_i, based on the OLS regression line, is Yi^=β0^+β1^Xi\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1}X_i.

This is compatible with the loss function mimimiser usage of the hat: claim C, which us β^0\hat\beta_0 and β^1\hat\beta_1 are loss function minimisers; claim D then tells us that Yi^\hat{Y_i} is the value obtained when you compute the values of b0b_0 and b1b_1 which minimise a loss function, and plug them into the regression function.

But, of course, this “predicted value” verbiage is incompatible with the sample analogue usage. Y^i\hat Y_i can’t be both the predicted value (whether in a sample or not) and the actual value in a sample. That would imply that predictions are always perfect!

So even if we amend claim C as I’ve done above, we still can’t say that the hat is consistently used to mean sample analogue, since in the case of Y^i\hat Y_i it’s apparently used to mean predicted value. (More specifically predicted value in a sample, one guesses from context).

Inconsistent causal language

Here is an entirely separate category of wrongdoing. In all of the above we have taken the statement Yi=β0+β1Xi+uiY_i = \beta_0 + \beta_1 X_i + u_i

to be an innocuous equality: YiY_i is equal to regression intercept, plus regression slope times XiX_i, plus some remaining difference. Call this this the algebraic claim.

But it turns out that the statement is sometimes used to make a completely different, and incredibly strong, causal claim. Econometricians switch between the two usages.

In keeping with the above structure, I’ll start by clearly stating the causal claim, then I’ll analyse quotes which trade on the ambiguity between the causal and algebraic claims.

The causal claim

We think of Yi=β0+β1Xi+uiY_i = \beta_0 + \beta_1 X_i + u_i

Not as a regression equation, but as a complete causal account of everything causally affecting YY. (Sometimes the equation is said to desribe the data generating process, another case of dressing a big implausible claim in sheep’s clothing). For example, if there are ϕ\phi things causally affecting YY, we have:

Yi=β0+β1Xi+β2Ai+β3Bi+...+βϕϕiY_i = \beta_0 + \beta_1 X_i + \beta_2A_i + \beta_3 B_i + ... + \beta_\phi \phi_i

We can think of this claim as equivalent to an infinite lists of counterfactuals, giving the potential values of YY for every combination of values of the causal factors X,A,B...,ϕX,A,B...,\phi. It also makes the claim that nothing else has a causal effect on YY.

(if we think the world is non-deterministic, the claim becomes Yi=β0+β1Xi+β2Ai+β3Bi+...+βϕϕi+εiY_i = \beta_0 + \beta_1 X_i + \beta_2A_i + \beta_3 B_i + ... + \beta_\phi \phi_i + \varepsilon_i, where εi\varepsilon_i are ii random variables, and we have a list of counterfactuals giving the potential distributions of YY for every combination of values of the causal factors.)

That’s a rather huge claim. In any realistic case, causal chains are incredibly long and entangled, so that basically everything affects everything else in some small way. So the claim often amounts to an entire causal model of the world.

Hermeneutics (II)

In the first part, I have restricted my attention to the confusions that arise when taking the algebraic interpretation as given. It’s clearly the interpretation they want you to use. Regression is a mathematical operation, “==” is an algebraic symbol, and so on. Phrases like “slope of the population regression line” are routinely used while no hint is ever made at any causal meaning of the claim Yi=β0+β1Xi+uiY_i = \beta_0 + \beta_1 X_i + u_i. But you’ll see below many claims which only make sense under the causal interpretation.

Algebra or causes?

Stock and Watson p. 158, claim E:

The term uu is the error term […]. This term contains all the other factors besides XX that determine the value of the dependent variable, YY, for a specific observation ii

This is a favourite trick: use a word like “determines”, which heavily implies a causal claim, but stay just shy of being unambigously causal. That way you can always retreat to the algebraic claim. (Other favourites which I see all the time in published papers are “contributes to”, “is associated with”, “explains”, “influences”…).

Indeed, under the algebraic interpretation, claim E is puzzling. What on earth does it mean for a number to “contain”, “factors” that “determine” the value of another number? As far as the mathematics is concerned, we have no concept of “determine”, much less of a number “containing” another number.

A causal variant of claim E would be:

The term uu is the further-causes term […]. This term contains all the other factors besides XX that cause the value of the dependent variable, YY, for a specific observation ii

Wooldrige, p.92f, claim F:

Assumption MRL.4:

E[ux1,x2,x3,...,xk]=0E[u \mid x_1, x_2, x_3 , ... , x_k]=0

When assumption MLR.4 holds, we often say that we have exogenous explanatory variables. If xjx_j is correlated with uu for any reason, then xjx_j is said to be an endogenous explanatory variable […] Unfortunately, we will never know for sure whether the average value of the unobservables is unrelated to the explanatory variables.

Under the algebraic interpretation, MLR.4 is the claim that the conditional expectation function is exactly the regression line. (In the notation I use above, ei=0e_i=0). This is a pretty strong claim, but has nothing to do with exogeneity. The exogeneity part of Claim F only makes sense under the causal interpretation, and I suspect that in the end we are to take Claim F causally. In that case, Claim F uses the language or correlation (“if xjx_j is correlated with uu for any reason”) to make an extremely strong causal claim. “Correlation does not imply causation” is a very good slogen which it would be beneficial to actually apply.

While claim F seems to require the causal interpretation, the phrase “error term” in claim E calls for the algebraic one. And most of the quotes from part one, such as claims A and B, which call Yi=β0+β1Xi+uiY_i = \beta_0 + \beta_1 X_i + u_i the “population regression function”, rely on the algebraic claim.

Stock and Watson, p.131, claim G:

The causal effect of a treatment is the expected effect on the outcome of interest of the treatment as measured in a ideal randomized controlled experiment. This effect can be expressed as the difference of two conditional expectations. Specifically, the causal effect on YY of treatment level xx is the difference in the conditional expectations E[YX=x]E[YX=0]E[Y \mid X=x] - E[Y \mid X=0] where E[YX=x]E[Y\mid X=x] is the expected value of of YY for the treatment group (which received treatment level X=xX=x) in an ideal randomized controlled experiment and E[YX=0]E[Y \mid X=0] is the expected value of YY for the control group (which receives treatment level X=0X=0).

Stock and Watson, p. 170, claim H:

The first of the three least squares assumptions is that the conditional distribution of uiu_i given XiX_i has a mean of zero. This assumption is a formal mathematical statement about the “other factors” contained in uiu_i and asserts that these other factors are unrelated to XiX_i in the sense that, given a value of XiX_i, the mean of the distribution of these other factors is zero.

Claim G is good because there is appropriate hedging: causal effects are the difference between conditional expectations only in an idealised RCT. An idealised RCT is the only case where where the causal claim and the algebraic claim have the same meaning.

In claim H however, the sentence “This assumption is a formal mathematical statement about the “other factors” contained in uiu_i” trades on the ambiguity between the algebraic and causal claims. Mathematical statements are about sums and products, not about causality in the world. This kind of writing promotes a kind of magical thinking in which, say, the expectation operator (really just a sum) can tell us about the what we would causally “expect” to see if we intervened on the world.

Knowns or unknowns?

I want to go back to a part of claim F, which I did not discuss above:

Unfortunately, we will never know for sure whether the average value of the unobservables is unrelated to the explanatory variables.

We see the same talk of unobservables in the University of Oxford Econometrics lecture slides, Michaelmas Term 2017, (claim J):

The simple regression model
yi=β1xi+β2+uiy_i = \beta_1 x_i + \beta_2 + u_i

  • yiy_i and xix_i are observable random scalars
  • uiu_i is the unobservable random disturbance or error
  • β1\beta_1 and β2\beta_2 are the parameters (constants) we would like to estimate

On the causal usage, uiu_i are indeed practically impossible to observe. But then so are β1\beta_1 and β2\beta_2, but these are simply called parameters, and not unobservables.

But on the algebraic usage, we are presumably to take β1\beta_1 and β2\beta_2 to be loss function minimising coefficients. Then, if yiy_i and xix_i are known, so are β1\beta_1 and β2\beta_2, and by a simple subtraction, uiu_i is known too.

The same thing happens Yi=E[YiXi]+wiY_i = E[Y_i \mid X_i] + w_i. When that equality is first introduced, it is presented as a mere piece of algebra. If we know YiY_i and XiX_i we can obviously get wiw_i by a subtraction. Yet econometricians insist on calling wiw_i unknown; they are laying the groundowrk to hoodwink you later by switching to the causal usage.

Appendix A

Proof that the solution to

minf(Xi)E[ui2]minf(Xi)E[(Yif(Xi))2]\min_{f(X_i)} E[u_i^2] \leftrightarrow \min_{f(X_i)} E[(Y_i-f(X_i))^2]

is f(Xi)=E[YiXi]f(X_i)=E[Y_i \mid X_i].

\square Taking the first-order condition:

dE[(Yif(Xi))2]df(Xi)=0E[d(Yif(Xi))2df(Xi)]=0E[Yi22Yif(Xi)+f(Xi)2df(Xi)]=0E[2Yi+2f(Xi)]=0E[2Yi+2f(Xi)Xi]=02E[YiXi]+2f(Xi)=0f(Xi)=E[YiXi]\begin{aligned} \frac{d E[(Y_i-f(X_i))^2]}{df(X_i)} &=0 \\ E[\frac{d (Y_i-f(X_i))^2}{df(X_i)}] &=0 \\ E[\frac{Y_i^2 -2Y_if(X_i)+f(X_i)^2}{df(X_i)}] &= 0 \\ E[-2Y_i+2f(X_i)] &=0 \\ E[-2Y_i+2f(X_i) \mid X_i] &=0 \\ -2E[Y_i \mid X_i] + 2f(X_i) &=0 \\ f(X_i) &= E[Y_i \mid X_i] \qquad \blacksquare \end{aligned}

Appendix B

Proof that the solution to

minβ0,β1E[(ei+ui)2]minβ0,β1E[(Yiβ0β1Xi)2]\min_{\beta_0, \beta_1} E[(e_i + u_i)^2] \leftrightarrow \min_{\beta_0, \beta_1} E[(Y_i - \beta_0 -\beta_1 X_i)^2]

is

β0=yˉβ1xˉβ1=i=1n(yiyˉ)(xixˉ)i=1n(xixˉ)2\begin{aligned} \beta_0 &= \bar{y}- \beta_1\bar{x} \\ \beta_1 &= \frac{\sum_{i=1}^n(y_i-\bar{y})(x_i- \bar{x})}{\sum_{i=1}^n (x_i- \bar{x})^2} \end{aligned}

\square Let

L=i=1n(yiβ1β2xi)2L = \sum_{i=1}^n (y_i - \beta_1 - \beta_2 x_i)^2

We solve:

minβ1,β2L{dLdβ1=0Call this (a)dLdβ2=0Call this (b){i=1n(d(yiβ1β2xi)2dβ1)=0derivative of a sum(b){i=1n2(yiβ1β2xi)=0(b){β1=yˉβ2xˉ(b){(a)i=1n(d(yiβ1β2xi)2dβ2)=0derivative of a sum{(a)i=1n2xi(yiβ1β2xi)=0{(a)i=1n2xi(yiyˉ+β2xˉβ2xi)=0substitute β1{(a)i=1nxi(yiyˉ+β2xˉβ2xi)=0divide by -2{(a)i=1nxi(yiyˉ)+i=1nβ2xi(xˉxi)=0rearrange{(a)i=1nxi(yiyˉ)=i=1nβ2xi(xixˉ)rearrange{(a)i=1n(xixˉ)(yiyˉ)=i=1nβ2(xixˉ)2by fact A{(a)β2=i=1n(yiyˉ)(xixˉ)i=1n(xixˉ)2rearrange{β1=yˉβ2xˉβ2=i=1n(yiyˉ)(xixˉ)i=1n(xixˉ)2\begin{aligned} &\min_{\beta_1, \beta_2} L \\ &\leftrightarrow \begin{cases} \frac{dL}{d \beta_1} =0 \qquad \text{Call this }(a) \\ \frac{dL}{d \beta_2} =0 \qquad \text{Call this }(b) \end{cases} \\ &\leftrightarrow \begin{cases} \sum_{i=1}^n(\frac{d(y_i-\beta_1-\beta_2x_i)^2}{d \beta_1}) =0 \qquad \qquad \text{derivative of a sum}\\ (b)\end{cases} \\ &\leftrightarrow \begin{cases}\sum_{i=1}^n-2(y_i-\beta_1-\beta_2x_i) =0 \\ (b)\end{cases}\\&\leftrightarrow \begin{cases}\beta_1 = \bar{y} - \beta_2\bar{x} \\ (b)\end{cases}\\\\ \\ &\leftrightarrow \begin{cases}(a) \\ \sum_{i=1}^n(\frac{d(y_i-\beta_1-\beta_2x_i)^2}{d \beta_2}) =0 \qquad \qquad \text{derivative of a sum}\end{cases} \\ &\leftrightarrow \begin{cases}(a)\\\sum_{i=1}^n-2x_i(y_i-\beta_1-\beta_2x_i) =0\end{cases} \\ &\leftrightarrow \begin{cases}(a)\\\sum_{i=1}^n-2x_i(y_i-\bar{y}+\beta_2\bar{x}-\beta_2x_i) =0 \qquad \qquad \text{substitute }\beta_1 \end{cases} \\ &\leftrightarrow \begin{cases}(a)\\\sum_{i=1}^n x_i(y_i-\bar{y}+\beta_2\bar{x}-\beta_2x_i) =0\qquad \qquad \text{divide by -2}\end{cases} \\ &\leftrightarrow \begin{cases}(a) \\ \sum_{i=1}^n x_i(y_i-\bar{y}) +\sum_{i=1}^n \beta_2 x_i(\bar{x} - x_i) =0 \qquad \qquad \text{rearrange}\end{cases} \\ &\leftrightarrow \begin{cases}(a) \\ \sum_{i=1}^n x_i(y_i-\bar{y}) =\sum_{i=1}^n \beta_2 x_i(x_i - \bar{x}) \qquad \qquad \text{rearrange}\end{cases} \\ &\leftrightarrow \begin{cases}(a) \\ \sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}) =\sum_{i=1}^n \beta_2 (x_i - \bar{x})^2 \qquad \qquad \text{by fact A} \end{cases} \\ &\leftrightarrow \begin{cases}(a) \\ \beta_2 = \frac{\sum_{i=1}^n(y_i-\bar{y})(x_i- \bar{x})}{\sum_{i=1}^n (x_i- \bar{x})^2} \qquad \qquad \text{rearrange}\end{cases} \\ \\ &\leftrightarrow \begin{cases}\beta_1 = \bar{y} - \beta_2\bar{x} \\ \beta_2 = \frac{\sum_{i=1}^n(y_i-\bar{y})(x_i- \bar{x})}{\sum_{i=1}^n (x_i- \bar{x})^2} \end{cases} \end{aligned}

Thus we write:

β1^=yˉβ2^xˉβ2^=i=1n(yiyˉ)(xixˉ)i=1n(xixˉ)2  \begin{aligned} \hat{\beta_1} &= \bar{y} - \hat{\beta_2}\bar{x} \\ \hat{\beta_2} &= \frac{\sum_{i=1}^n(y_i-\bar{y})(x_i- \bar{x})}{\sum_{i=1}^n (x_i- \bar{x})^2} \ \ \blacksquare \end{aligned}

Fact A is proven here:

i=1n(xixˉ)(yiyˉ)=i=1n(xiyixiyˉxˉyi+xˉyˉ)=i=1n(xiyi)yˉi=1n(xi)xˉi=1n(yi)+nxˉyˉ=i=1n(xiyi)nyˉxˉnyˉxˉ+nxˉyˉ=i=1n(xiyi)nyˉxˉ=i=1n(xiyi)i=1nyixˉ=i=1n(xiyi)i=1nxiyˉ=i=1nyi(xixˉ)factorise yi=i=1nxi(yiyˉ)factorise xi\begin{aligned} \sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}) &= \sum_{i=1}^n(x_iy_i - x_i\bar{y} -\bar{x}y_i +\bar{x}\bar{y}) \\ &= \sum_{i=1}^n(x_iy_i) -\bar{y}\sum_{i=1}^n(x_i) - \bar{x}\sum_{i=1}^n(y_i) + n\bar{x}\bar{y} \\ &= \sum_{i=1}^n(x_iy_i) -n\bar{y}\bar{x} - n\bar{y}\bar{x} + n\bar{x}\bar{y} \\ &= \sum_{i=1}^n(x_iy_i) -n\bar{y}\bar{x} \\ &= \sum_{i=1}^n(x_iy_i) -\sum_{i=1}^ny_i\bar{x} & &= \sum_{i=1}^n(x_iy_i) -\sum_{i=1}^nx_i\bar{y} \\ &= \sum_{i=1}^n y_i(x_i-\bar{x}) \qquad\text{factorise } y_i & &= \sum_{i=1}^n x_i(y_i-\bar{y}) \qquad\text{factorise } x_i \end{aligned}

A special case of fact A is:

i=1n(xixˉ)2=i=1nxi(xixˉ)\sum_{i=1}^n (x_i-\bar{x})^2 =\sum_{i=1}^n x_i(x_i-\bar{x})

Appendix C

Proof that

minβUUβ=(XX)1XY\begin{aligned} \min_{\beta} U'U & \leftrightarrow \beta = (X'X)^{-1} X'Y \\ \end{aligned}

Where Y=Xβ+UY=X\beta+U

and where

Y=[y1y2yn]X=[1x1,1x2,1xK,11x1,2x2,2xK,21x1,nx2,nxK,n]Y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \qquad X = \begin{bmatrix} 1& x_{1,1} & x_{2,1} & \cdots & x_{K,1}\\ 1& x_{1,2} & x_{2,2} & \cdots & x_{K,2}\\ \vdots &\vdots & \vdots & \ddots & \vdots \\ 1& x_{1,n} & x_{2,n} & \cdots & x_{K,n} \end{bmatrix} β=[β0β1βK]T=[β0β1βK]U=[u1u2un]\beta= \begin{bmatrix} \beta_0 & \beta_1 & \cdots & \beta_K \end{bmatrix}^T=\begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_K \end{bmatrix} \qquad U=\begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{bmatrix}

\square Then UU=[i=1nuiui]=SSRU'U = \begin{bmatrix} \sum_{i=1}^n u_i u_i \end{bmatrix} = SSR

We can also write:

UU=(YXβ)(YXβ)=(Y(Xβ))(YXβ)=YYYXβ(Xβ)Y+(Xβ)(Xβ)=YY((Xβ)Y)(Xβ)Y+βXXβ\begin{aligned} U'U &= (Y-X\beta)'(Y-X\beta) \\ &= (Y' -(X\beta)')(Y-X\beta) \\ &= Y'Y - Y'X\beta -(X\beta)'Y +(X\beta)'(X\beta) \\ &= Y'Y - ((X\beta)'Y)' -(X\beta)'Y + \beta'X'X\beta \\ \end{aligned}

Since (Xβ)Y(X\beta)'Y is a scalar, it is equal to its transpose YXβY'X\beta. Thus:

UU=YY2YXβ+βXXβ\begin{aligned} U'U&= Y'Y -2Y'X\beta + \beta'X'X\beta \end{aligned}

We then solve:

minβUUdUUdβ=0dYYdβ2d YXβdβ+dβXXβdβ=002(YX)+(XX+(XX))β=02XY+2XXβ=0XXβ=XY\begin{aligned} \min_{\beta} U'U &\leftrightarrow \frac{dU'U}{d\beta}=0\\ & \leftrightarrow \frac{dY'Y }{d\beta} -2\frac{d\ Y'X\beta}{d\beta}+ \frac{d\beta'X'X\beta}{d\beta} =0 \\ & \leftrightarrow 0 -2(Y'X)' + (X'X+(X'X)')\beta =0 \\ & \leftrightarrow -2X'Y + 2X'X\beta =0 \\ & \leftrightarrow X'X\beta = X'Y \\ \end{aligned}

Assuming that XXX'X is invertible (since it’s a square matrix, this is equivalent to det(XX)0det(X'X)\neq0), we have:

minβUUβ=(XX)1XY  \begin{aligned} \min_{\beta} U'U & \leftrightarrow \beta = (X'X)^{-1} X'Y \ \ \blacksquare \\ \end{aligned}

  1. For the curious, or those who have to much time on their hands, I include a full version history, showing how this document evolved over the past few weeks. It’s an interesting window into my thought process. 

  2. As a separate gripe from the main one in this post, I note that often what I call ei+wie_i+w_i is just written as wiw_i, by this I mean that in the same document, people will write Yi=β0+β1Xi+wiY_i = \beta_0 + \beta_1X_i + w_i and Yi=E[YiXi]+wiY_i = E[Y_i \mid X_i] + w_i. This is either a terrible choice of notation (same name for two different objects) or an implicit and unnecessary (in this case) assumption that ei=0e_i=0 and ui=wiu_i = w_i

Read more:

Leave feedback on this post