The special case of the normal likelihood function

Summary1: The likelihood function implied by an estimate bb with standard deviation σ\sigma is the probability density function (PDF) of a N(b,σ2)\mathcal{N}(b,\sigma^2). Though this might sound intuitive, it’s actually a special case. If we don’t firmly grasp that it’s an exception, it can be confusing.

Suppose that a study has the point estimator BB for the parameter Θ\Theta. The study results are an estimate B=bB=b (typically a regression coefficient), and an estimated standard deviation2 sd^(B)=s\hat{sd}(B)=s.

In order to know how to combine this information with a prior over Θ\Theta in order to update our beliefs, we need to know what is the likelihood function implied by the study. The likelihood function is the probability of observing the study data B=bB=b given different values for Θ\Theta. It is formed from the probability of the observation that B=bB=b conditional on Θ=θ\Theta=\theta, but viewed and used as a function of θ\theta only3:

L:θP(B=bΘ=θ)\mathcal{L}: \theta \mapsto P(B =b \mid \Theta = \theta)

The event “B=bB=b” is often shortened to just “bb” when the meaning is clear from context, so that the function can be more briefly written L:θP(bθ)\mathcal{L}: \theta \mapsto P(b \mid \theta).

So, what is L\mathcal{L}? In a typical regression context, BB is assumed to be approximately normally distributed around Θ\Theta, due to the central limit theorem. More precisely, BΘsd(B)N(0,1)\frac{B - \Theta}{sd(B)} \sim \mathcal{N}(0,1), and equivalently BN(Θ,sd(B)2)B\sim \mathcal{N}(\Theta,sd(B)^2).

sd(B)sd(B) is seldom known, and is often replaced with its estimate ss, allowing us to write BN(Θ,s2)B\sim \mathcal{N}(\Theta,s^2), where only the parameter Θ\Theta is unknown4.

We can plug this into the definition of the likelihood function:

L:θP(bθ)=PDFN(θ,s2)(b)=1s2πexp(12(bθs)2)\mathcal{L}: \theta \mapsto P(b\mid \theta)= \text{PDF}_{\mathcal{N}(\theta,s^2)}(b) = {\frac {1}{s\sqrt {2\pi }}}\exp \left(-{\frac {1}{2}}\left({\frac {b-\theta }{s }}\right)^{2} \right)

We could just leave it at that. L\mathcal{L} is the function5 above, and that’s all we need to compute the posterior. But a slightly different expression for L\mathcal{L} is possible. After factoring out the square,

L:θ1s2πexp(12(bθ)2s2),\mathcal{L}: \theta \mapsto {\frac {1}{s {\sqrt {2\pi }}}}\exp \left(-{\frac {1}{2}} {\frac {(b-\theta)^2 }{s^2 }} \right),

we make use of the fact that (bθ)2=(θb)2(b-\theta)^2 = (\theta-b)^2 to rewrite L\mathcal{L} with the positions of θ\theta and bb flipped:

L:θ1s2πexp(12(θbs)2).\mathcal{L}: \theta \mapsto {\frac {1}{s {\sqrt {2\pi }}}}\exp \left(-{\frac {1}{2}}\left({\frac {\theta-b }{s }}\right)^{2} \right).

We then notice that L\mathcal{L} is none other than

L:θPDFN(b,s2)(θ)\mathcal{L}: \theta \mapsto \text{PDF}_{\mathcal{N}(b,s^2)}(\theta)

So, for all bb and for all θ\theta, L:θPDFN(θ,s2)(b)=PDFN(b,s2)(θ)\mathcal{L}: \theta \mapsto \text{PDF}_{\mathcal{N}(\theta,s^2)}(b) = \text{PDF}_{\mathcal{N}(b,s^2)}(\theta).

The key thing to realise is that this is a special case due to the fact that the functional form of the normal PDF is invariant to substituting bb and θ\theta for each other. For many other distributions of BB, we cannot apply this procedure.

This special case is worth commenting upon because it has personally lead me astray in the past. I often encountered the case where BB is normally distributed, and I used the equality above without deriving it and understanding where it comes from. It just had a vaguely intuitive ring to it. I would occasionally slip into thinking it was a more general rule, which always resulted in painful confusion.

To understand the result, let us first illustrate it with a simple numerical example. Suppose we observe an Athenian man b=200b=200 cm tall. For all θ\theta, the likelihood of this observation if Athenian men’s heights followed an N(θ,10)\mathcal{N}(\theta,10) is the same number as the density of observing an Athenian θ\theta cm tall if Athenian men’s heights followed a N(200,10)\mathcal{N}(200,10)6.

Graphical representation of PDFN(θ,10)(200)=PDFN(200,10)(θ)\text{PDF}_{\mathcal{N}(\theta,10)}(200) = \text{PDF}_{\mathcal{N}(200,10)}(\theta)

When encountering this equivalence, you might, like me, sort of nod along. But puzzlement would be a more appropriate reaction. To compute the likelihood of our 200 cm Athenian under different Θ\Theta-values, we can substitute a totally different question: “assuming that Θ=200\Theta=200, what is the probability of seeing Athenian men of different sizes?”.

The puzzle is, I think, best resolved by viewing it as a special case, an algebraic curiosity that only applies to some distributions. Don’t even try to build an intuition for it, because it does not generalise.

To help understand this better, let’s look at at a case where the procedure cannot be applied.

Suppose for example that BB is binomially distributed, representing the number of successes among nn independent trials with success probability Θ\Theta. We’ll write BBin(n,θ)B \sim \text{Bin}(n, \theta).

BB’s probability mass function is

g:kPMFBin(n,θ)(k)=(nk)ϕk(1ϕ)nkg: k \mapsto \text{PMF}_{\text{Bin}(n, \theta)}(k) = {n \choose k} \phi^k (1-\phi)^{n-k}

Meanwhile, the likelihood function for the observation of bb successes is

M:ϕPMFBin(n,θ)(b)=(nb)ϕb(1ϕ)nb\mathcal{M}: \phi \mapsto \text{PMF}_{\text{Bin}(n, \theta)}(b) = {n \choose b} \phi^b (1-\phi)^{n-b}

To attempt to take the PMF gg, set its parameter θ\theta equal to bb, and obtain the likelihood function would not just give incorrect values, it would be a domain error. Regardless of how we set its parameters, gg could never be equal to the likelihood function M\mathcal{M}, because gg is defined on {0,1,...,n}\{0,1,...,n\}, whereas M\mathcal{M} is defined on [0,1][0,1].

img

The likelihood function Q:PHPH2(1PH)\mathcal{Q}: P_H \mapsto P_H^2(1-P_H) for the binomial probability of a biased coin landing heads-up, given that we have observed {Heads,Heads,Tails}\{Heads, Heads, Tails\}. It is defined on [0,1][0,1]. (The constant factor (32)3 \choose 2 is omitted, a common practice with likelihood functions, because these constant factors have no meaning and make no difference to the posterior distribution.)

It’s hopefully now quite intuitive that the case where BB is normally distributed was a special case.7

Let’s recapitulate.

The likelihood function is the probability of bθb\mid\theta viewed as a function of θ\theta only. It is absolutely not a density of θ\theta.

In the special case where BB is normally distributed, we have the confusing ability of being able to express this function as if it were the density of θ\theta under a distribution that depends on bb.

I think it’s best to think of that ability as an algebraic coincidence, due to the functional form of the normal PDF. We should think of L\mathcal{L} in the case where BB is normally distributed as just another likelihood function.

Finally, I’d love to know if there is some way to view this special case as enlightening rather than just a confusing exception.

I believe that to say that a PDFθ,Γ(b)=PDFb,Γ(θ)\text{PDF}_{\theta,\Gamma}(b)=\text{PDF}_{b,\Gamma}(\theta) (where PDFψ,Γ\text{PDF}_{\psi,\Gamma} denotes the PDF of a distribution with one parameter ψ\psi that we wish to single out and a vector Γ\Gamma of other parameters), is equivalent to saying that the PDF is symmetric around its singled-out parameter. For example, a N(μ,σ2)\mathcal{N}(\mu,\sigma^2) is symmetric around its parameter μ\mu. But this hasn’t seemed insightful to me. Please write to me if you know an answer to this.

  1. Thanks to Gavin Leech and Ben West for feedback on a previous versions of this post. 

  2. I do not use the confusing term ‘standard error’, which I believe should mean sd(B)sd(B) but is often also used to also denote its estimate ss

  3. I use uppercase letters Θ\Theta and BB to denote random variables, and lower case θ\theta and bb for particular values (realizations) these random variables could take. 

  4. A more sophisticated approach would be to let sd(B)sd(B) be another unknown parameter over which we form a prior; we would then update our beliefs jointly about Θ\Theta and sd(B)sd(B). See for example Bolstad & Curran (2016), Chapter 17, “Bayesian Inference for Normal with Unknown Mean and Variance”

  5. I don’t like the term “likelihood distribution”, I prefer “likelihood function”. In formal parlance, mathematical distributions are a generalization of functions, so it’s arguably technically correct to call any likelihood function a likelihood distribution. But in many contexts, “distribution” is merely used as short for “probability distribution”. So “likelihood distribution” runs the risk of making us think of “likelihood probability distribution” – but the likelihood function is not generally a probability distribution. 

  6. We are here ignoring any inadequacies of the BN(Θ,s2)B\sim N(\Theta,s^2) assumption, including but not limited to the fact that one cannot observe men with negative heights. 

  7. Another simple reminder that the procedure couldn’t possibly work in general is that in general the likelihood function is not even a PDF at all. For example, a broken thermometer that always gives the temperature as 20 degrees has P(B=20θ)=1P(B=20 \mid \theta) = 1 for all θ\theta, which evidently does not integrate to 1 over all values of θ\theta.

    To take a different tack, the fact that the likelihood function is invariant to reparametrization also illustrates that it is not a probability density of θ\theta (thanks to Gavin Leech for the link). 

July 31, 2021
Read more:

Leave feedback on this post