How much of the fall in fertility could be explained by lower mortality?

ourworldindata scatter fertility vs infant survival

Many people think that lower child mortality causes fertility to decline.

One prominent theory for this relationship, as described by Our World in Data1, is that “infant survival reduces the parents’ demand for children”2. (Infants are children under 1 years old).

In this article, I want to look at how we can precisify that theory, and what magnitude the effect could possibly take. What fraction of the decline in birth rates could the theory explain?

Important. I don’t want to make claims here about how parents actually make fertility choices. I only want to examine the implications of various models, and specifically how much of the observed changes in fertility the models could explain.

Constant number of children

One natural interpretation of “increasing infant survival reduces the parents’ demand for children” is that parents are adjusting the number of births to keep the number of surviving children constant.

Looking at Our World in Data’s graph, we can see that in most of the countries depicted, the infant survival rate went from about 80% to essentially 100%. This is a factor of 1.25. Meanwhile, there were 1/3 as many births. If parents were adjusting the number of births to keep the number of surviving children constant, the decline in infant mortality would explain a change in births by a factor of 1/1.25=0.8, a -0.2 change that is only 30% of the -2/3 change in births.

The basic mathematical reason this happens is that even when mortality is tragically high, the survival rate is still thankfully much closer to 1 than to 0, so even a very large proportional fall in mortality will only amount to a small proportional increase in survival.

Some children survive infancy but die later in childhood. Although Our World in Data’s quote focuses on infant mortality, it makes sense to consider older children too. I’ll look at under-5 mortality, which generally has better data than older age groups, and also captures a large fraction of all child mortality3.

England (1861-1951)

England is a country with an early demographic transition and good data available.

Doepke 2005 quotes the following numbers:

  1861 1951
Infant mortality 16% 3%
1-5yo mortality 13% 0.5%
0-5 yo mortality 27% 3.5%
Survival to 5 years 73% 96.5%
Fertility 4.9 2.1

Fertility fell by 57%, while survival to 5 years rose by 32%. Hence, if parents aim to keep the number of surviving children constant, the change in child survival can explain 43%4 of the actual fall in fertility. (It would have explained only 23% had we erroneously considered only the change in infant survival.)

Sub-Saharan Africa (1990-2017)

If we look now at sub-Saharan Africa data from the World Bank, the 1990-2017 change in fertility is from 6.3 to 4.8, a 25% decrease, whereas the 5-year survival rate went from 0.82 to 0.92, a 12% increase. So the fraction of the actual change in fertility that could be explained by the survival rate is 44%. (This would have been 23% had we looked only at infant survival).

Source data and calculations

So far, we have seen that this very simple theory of parental decision-making can explain 30-44% of the decline in fertility, while also noticing that considering childhood mortality beyond infancy was important to giving the theory its full due.

However, in more sophisticated models of fertility choices, the theory looks worse.

A more sophisticated model of fertility decisions

Let us imagine that instead of holding it constant, parents treat the number of surviving children as one good among many in an optimization problem.

An increase in the child survival rate can be seen as a decrease in the cost of surviving children. Parents will then substitute away from other goods and increase their target number of surviving children. If your child is less likely to die as an infant, you may decide to aim to have more children: the risk of experiencing the loss of a child is lower.5

For a more formal analysis, we can turn to the Barro and Becker (1989) model of fertility. I’ll be giving a simplified version of the presentation in Doepke 2005.

In this model, parents care about their own consumption as well as their number of surviving children. The parents maximise6

U(c,n)=u(c)+nϵVU(c,n) = u(c) + n^\epsilon V


  • nn is the number of surviving children and VV is the value of a surviving child
  • ϵ\epsilon is a constant (0,1)\in (0,1)
  • u(c)u(c) is the part of utility that depends on consumption7

The income of a parent is ww, and there is a cost per birth of pp and an additional cost of qq per surviving child8. The parents choose bb, the number of births. ss is the probability of survival of a child, so that n=sbn=sb.

Consumption is therefore c=w(p+qs)bc=w-(p+qs)b and the problem becomes maxbU=u(w(p+qs)b)+(sb)ϵV\max_{b} U = u(w-(p+qs)b) + (sb)^\epsilon V

Letting b(s)b^{*}(s) denote the optimal number of births as a function of ss, what are its properties?

The simplest one is that sb(s)sb^*(s), the number of surviving children, is increasing in ss. This is the substitution effect we described intuitively earlier in this section. This means that if ss is multiplied by a factor xx (say 1.25), b(s)b^*(s) will be multiplied more than 1/x1/x (more than 0.8).

When we looked at the simplest model, with a constant number of children, we guessed that it could explain 30-44% of the fall in fertility. That number is a strict upper bound on what the current model could explain.

What we really want to know, to answer the original question, is how b(s)b^*(s) itself depends on ss. To do this, we need to get a little bit more into the relative magnitude of the cost per birth pp and the additional cost qq per surviving child. As Doepke writes,

If a major fraction of the total cost of children accrues for every birth, fertility [i.e. b(s)b^*(s)] would tend to increase with the survival probability; the opposite holds if children are expensive only after surviving infancy9.

This tells us that falling mortality could actually cause fertility to increase rather than decrease.10

To go further, we need to plug in actual values for the model parameters. Doepke does this, using numbers that reflect the child mortality situation of England in 1861 and 1951, but also what seem to be some pretty arbitrary assumptions about the parent’s preferences (the shape of uu and the value of ϵ\epsilon).

With these assumptions, he finds that “the total fertility rate falls from 5.0 (the calibrated target) to 4.2 when mortality rates are lowered to the 1951 level”11, a 16% decrease. This represents is 28% of the actually observed fall in fertility to 2.1.

Extensions of Barro-Becker model

The paper then considers various extensions of the basic Barro-Becker model to see if they could explain the large decrease in fertility that we observe.

For example, it has been hypothesized that when there is uncertainty about whether a child will survive (hitherto absent from the models), parents want to avoid the possibility of ending up with zero surviving children. They therefore have many children as a precautionary measure. Declining mortality (which reduces uncertainty since survival rates are thankfully greater than 0.5) would have a strong negative impacts on births.

However, Doepke also considers a third model, that incorporates not only stochastic mortality but also sequential fertility choice, where parents may condition their fertility decisions on the observed survival of children that were born previously. The sequential aspect reduces the uncertainty that parents face over the number of surviving children they will end up with.

The stochastic and sequential models make no clear-cut predictions based on theory alone. Using the England numbers, however, Doepke finds a robust conclusion. In the stochastic+sequential model, for almost all reasonable parameter values, the expected number of surviving children still increases with ss (my emphasis):

To illustrate this point, let us consider the extreme case [where] utility from consumption is close to linear, while risk aversion with regards to the number of surviving children is high. … [W]hen we move (with the same parameters) to the more realistic sequential model, where parents can replace children who die early, … despite the high risk aversion with regards to the number of children, total fertility drops only to 4.0, and net fertility rises to 3.9, just as with the benchmark parameters. … Thus, in the sequential setup the conclusion that mortality decline raises net fertility is robust to different preference specifications, even if we deliberately emphasize the precautionary motive for hoarding children.

So even here, the fall in mortality would only explain 35% of the actually observed change in fertility. It seems that the ability to “replace” children who did not survive in the sequential model is enough to make its predictions pretty similar to the simple Barro-Becker model.

  1. The quote in context on Our World in Data’s child mortality page: “the causal link between infant [<1 year old] survival and fertility is established in both directions: Firstly, increasing infant survival reduces the parents’ demand for children. And secondly, a decreasing fertility allows the parents to devote more attention and resources to their children.” 

  2. As an aside, my impression is that if you asked an average educated person “Why do women in developing countries have more children?”, their first idea would be: “because child mortality is higher”. It’s almost a trope, and I feel that it’s often mentioned pretty glibly, without actually thinking about the decisions and trade-offs faced by the people concerned. That’s just an aside though – the theory clearly has prima facie plausibility, and is also cited in serious places like academia and Our World in Data. It deserves closer examination. 

  3. It should be possible to conduct the Africa analysis for different ages using IMHE’s more granular data, but it’s a bit more work. (There appears to be no direct data on deaths per birth as opposed to per capita, and data on fertility is contained in a different dataset from the main Global Burden of Disease data.) 

  4. All things decay. Should this Google Sheets spreadsheet become inaccessible, you can download this .xlsx copy which is stored together with this blog. 

  5. In this light, we can see that the constant model is not really compatible with parents viewing additional surviving children as a (normal) good. Nor of course is it compatible with viewing children as a bad, for then parents would choose to have 0 children. Instead, it could for example be used to represent parents aiming for a socially normative number of surviving children. 

  6. I collapse Doepke’s β\beta and VV into a single constant VV, since they can be treated as such in Model A, the only model that I will present mathematically in this post. 

  7. Its actual expression, that I omit from the main presentation for simplicity, is u(c)=c1σ1σu(c)=\frac{c^{1-\sigma}}{1-\sigma}, the constant relative risk-aversion utility function. 

  8. There is nothing in the model that compels us to call pp the “cost per birth”, this is merely for ease of exposition. The model itself only assumes that there are two periods for each child: in the first period, costing pp to start, children face a mortality risk; and in the second period, those who survived the first face zero mortality risk and cost qq

  9. Once again, Doepke calls the model’s early period “infancy”, but this is not inherent in the model. 

  10. It’s difficult to speculate about the relative magnitude of pp and qq, especially if, departing from Doepke, we make the early period of the model, say, the first 5 years of life. If the first period is only infancy, it seems plausible to me that qpq \gg p, but then we also fail to capture any deaths after infancy. On the other hand, extending the early period to 5 incorrectly assumes that parents get no utility from children before they reach the age of 5. 

  11. The following additional context may be helpful to understand this quote:

    The survival parameters are chosen to correspond to the situation in England in 1861 . According to Perston et al. (1972) the infant mortality rate (death rate until first birthday) was 16%16 \%, while the child mortality rate (death rate between first and fifth birthday) was 13%13 \%. Accordingly, I set si=0.84s_{i}=0.84 and sy=0.87s_{y}=0.87 in the sequential model, and s=sisy=0.73s=s_{i} s_{y}=0.73 in the other models. Finally, the altruism factor β\beta is set in each model to match the total fertility rate, which was 4.94.9 in 1861 (Chenais 1992). Since fertility choice is discrete in Models B and C, I chose a total fertility rate of 5.05.0 as the target.

    Each model is thus calibrated to reproduce the relationship of fertility and infant and child mortality in 1861 . I now examine how fertility adjusts when mortality rates fall to the level observed in 1951 , which is 3%3 \% for infant mortality and 0.5%0.5 \% for child mortality. The results for fertility can be compared to the observed total fertility rate of 2.12.1 in 1951 .

    In Model A (Barro-Becker with continuous fertility choice), the total fertility rate falls from 5.05.0 (the calibrated target) to 4.24.2 when mortality rates are lowered to the 1951 level. The expected number of surviving children increases from 3.73.7 to 4.04.0. Thus, there is a small decline in total fertility, but (as was to be expected given Proposition 1) an increase in the net fertility rate.

August 5, 2021

The special case of the normal likelihood function

Summary1: The likelihood function implied by an estimate bb with standard deviation σ\sigma is the probability density function (PDF) of a N(b,σ2)\mathcal{N}(b,\sigma^2). Though this might sound intuitive, it’s actually a special case. If we don’t firmly grasp that it’s an exception, it can be confusing.

Suppose that a study has the point estimator BB for the parameter Θ\Theta. The study results are an estimate B=bB=b (typically a regression coefficient), and an estimated standard deviation2 sd^(B)=s\hat{sd}(B)=s.

In order to know how to combine this information with a prior over Θ\Theta in order to update our beliefs, we need to know what is the likelihood function implied by the study. The likelihood function is the probability of observing the study data B=bB=b given different values for Θ\Theta. It is formed from the probability of the observation that B=bB=b conditional on Θ=θ\Theta=\theta, but viewed and used as a function of θ\theta only3:

L:θP(B=bΘ=θ)\mathcal{L}: \theta \mapsto P(B =b \mid \Theta = \theta)

The event “B=bB=b” is often shortened to just “bb” when the meaning is clear from context, so that the function can be more briefly written L:θP(bθ)\mathcal{L}: \theta \mapsto P(b \mid \theta).

So, what is L\mathcal{L}? In a typical regression context, BB is assumed to be approximately normally distributed around Θ\Theta, due to the central limit theorem. More precisely, BΘsd(B)N(0,1)\frac{B - \Theta}{sd(B)} \sim \mathcal{N}(0,1), and equivalently BN(Θ,sd(B)2)B\sim \mathcal{N}(\Theta,sd(B)^2).

sd(B)sd(B) is seldom known, and is often replaced with its estimate ss, allowing us to write BN(Θ,s2)B\sim \mathcal{N}(\Theta,s^2), where only the parameter Θ\Theta is unknown4.

We can plug this into the definition of the likelihood function:

L:θP(bθ)=PDFN(θ,s2)(b)=1s2πexp(12(bθs)2)\mathcal{L}: \theta \mapsto P(b\mid \theta)= \text{PDF}_{\mathcal{N}(\theta,s^2)}(b) = {\frac {1}{s\sqrt {2\pi }}}\exp \left(-{\frac {1}{2}}\left({\frac {b-\theta }{s }}\right)^{2} \right)

We could just leave it at that. L\mathcal{L} is the function5 above, and that’s all we need to compute the posterior. But a slightly different expression for L\mathcal{L} is possible. After factoring out the square,

L:θ1s2πexp(12(bθ)2s2),\mathcal{L}: \theta \mapsto {\frac {1}{s {\sqrt {2\pi }}}}\exp \left(-{\frac {1}{2}} {\frac {(b-\theta)^2 }{s^2 }} \right),

we make use of the fact that (bθ)2=(θb)2(b-\theta)^2 = (\theta-b)^2 to rewrite L\mathcal{L} with the positions of θ\theta and bb flipped:

L:θ1s2πexp(12(θbs)2).\mathcal{L}: \theta \mapsto {\frac {1}{s {\sqrt {2\pi }}}}\exp \left(-{\frac {1}{2}}\left({\frac {\theta-b }{s }}\right)^{2} \right).

We then notice that L\mathcal{L} is none other than

L:θPDFN(b,s2)(θ)\mathcal{L}: \theta \mapsto \text{PDF}_{\mathcal{N}(b,s^2)}(\theta)

So, for all bb and for all θ\theta, L:θPDFN(θ,s2)(b)=PDFN(b,s2)(θ)\mathcal{L}: \theta \mapsto \text{PDF}_{\mathcal{N}(\theta,s^2)}(b) = \text{PDF}_{\mathcal{N}(b,s^2)}(\theta).

The key thing to realise is that this is a special case due to the fact that the functional form of the normal PDF is invariant to substituting bb and θ\theta for each other. For many other distributions of BB, we cannot apply this procedure.

This special case is worth commenting upon because it has personally lead me astray in the past. I often encountered the case where BB is normally distributed, and I used the equality above without deriving it and understanding where it comes from. It just had a vaguely intuitive ring to it. I would occasionally slip into thinking it was a more general rule, which always resulted in painful confusion.

To understand the result, let us first illustrate it with a simple numerical example. Suppose we observe an Athenian man b=200b=200 cm tall. For all θ\theta, the likelihood of this observation if Athenian men’s heights followed an N(θ,10)\mathcal{N}(\theta,10) is the same number as the density of observing an Athenian θ\theta cm tall if Athenian men’s heights followed a N(200,10)\mathcal{N}(200,10)6.

density likelihood

Graphical representation of PDFN(θ,10)(200)=PDFN(200,10)(θ)\text{PDF}_{\mathcal{N}(\theta,10)}(200) = \text{PDF}_{\mathcal{N}(200,10)}(\theta)

When encountering this equivalence, you might, like me, sort of nod along. But puzzlement would be a more appropriate reaction. To compute the likelihood of our 200 cm Athenian under different Θ\Theta-values, we can substitute a totally different question: “assuming that Θ=200\Theta=200, what is the probability of seeing Athenian men of different sizes?”.

The puzzle is, I think, best resolved by viewing it as a special case, an algebraic curiosity that only applies to some distributions. Don’t even try to build an intuition for it, because it does not generalise.

To help understand this better, let’s look at at a case where the procedure cannot be applied.

Suppose for example that BB is binomially distributed, representing the number of successes among nn independent trials with success probability Θ\Theta. We’ll write BBin(n,θ)B \sim \text{Bin}(n, \theta).

BB’s probability mass function is

g:kPMFBin(n,θ)(k)=(nk)ϕk(1ϕ)nkg: k \mapsto \text{PMF}_{\text{Bin}(n, \theta)}(k) = {n \choose k} \phi^k (1-\phi)^{n-k}

Meanwhile, the likelihood function for the observation of bb successes is

M:ϕPMFBin(n,θ)(b)=(nb)ϕb(1ϕ)nb\mathcal{M}: \phi \mapsto \text{PMF}_{\text{Bin}(n, \theta)}(b) = {n \choose b} \phi^b (1-\phi)^{n-b}

To attempt to take the PMF gg, set its parameter θ\theta equal to bb, and obtain the likelihood function would not just give incorrect values, it would be a domain error. Regardless of how we set its parameters, gg could never be equal to the likelihood function M\mathcal{M}, because gg is defined on {0,1,...,n}\{0,1,...,n\}, whereas M\mathcal{M} is defined on [0,1][0,1].


The likelihood function Q:PHPH2(1PH)\mathcal{Q}: P_H \mapsto P_H^2(1-P_H) for the binomial probability of a biased coin landing heads-up, given that we have observed {Heads,Heads,Tails}\{Heads, Heads, Tails\}. It is defined on [0,1][0,1]. (The constant factor (32)3 \choose 2 is omitted, a common practice with likelihood functions, because these constant factors have no meaning and make no difference to the posterior distribution.)

It’s hopefully now quite intuitive that the case where BB is normally distributed was a special case.7

Let’s recapitulate.

The likelihood function is the probability of bθb\mid\theta viewed as a function of θ\theta only. It is absolutely not a density of θ\theta.

In the special case where BB is normally distributed, we have the confusing ability of being able to express this function as if it were the density of θ\theta under a distribution that depends on bb.

I think it’s best to think of that ability as an algebraic coincidence, due to the functional form of the normal PDF. We should think of L\mathcal{L} in the case where BB is normally distributed as just another likelihood function.

Finally, I’d love to know if there is some way to view this special case as enlightening rather than just a confusing exception.

I believe that to say that a PDFθ,Γ(b)=PDFb,Γ(θ)\text{PDF}_{\theta,\Gamma}(b)=\text{PDF}_{b,\Gamma}(\theta) (where PDFψ,Γ\text{PDF}_{\psi,\Gamma} denotes the PDF of a distribution with one parameter ψ\psi that we wish to single out and a vector Γ\Gamma of other parameters), is equivalent to saying that the PDF is symmetric around its singled-out parameter. For example, a N(μ,σ2)\mathcal{N}(\mu,\sigma^2) is symmetric around its parameter μ\mu. But this hasn’t seemed insightful to me. Please write to me if you know an answer to this.

  1. Thanks to Gavin Leech and Ben West for feedback on a previous versions of this post. 

  2. I do not use the confusing term ‘standard error’, which I believe should mean sd(B)sd(B) but is often also used to also denote its estimate ss

  3. I use uppercase letters Θ\Theta and BB to denote random variables, and lower case θ\theta and bb for particular values (realizations) these random variables could take. 

  4. A more sophisticated approach would be to let sd(B)sd(B) be another unknown parameter over which we form a prior; we would then update our beliefs jointly about Θ\Theta and sd(B)sd(B). See for example Bolstad & Curran (2016), Chapter 17, “Bayesian Inference for Normal with Unknown Mean and Variance”

  5. I don’t like the term “likelihood distribution”, I prefer “likelihood function”. In formal parlance, mathematical distributions are a generalization of functions, so it’s arguably technically correct to call any likelihood function a likelihood distribution. But in many contexts, “distribution” is merely used as short for “probability distribution”. So “likelihood distribution” runs the risk of making us think of “likelihood probability distribution” – but the likelihood function is not generally a probability distribution. 

  6. We are here ignoring any inadequacies of the BN(Θ,s2)B\sim N(\Theta,s^2) assumption, including but not limited to the fact that one cannot observe men with negative heights. 

  7. Another simple reminder that the procedure couldn’t possibly work in general is that in general the likelihood function is not even a PDF at all. For example, a broken thermometer that always gives the temperature as 20 degrees has P(B=20θ)=1P(B=20 \mid \theta) = 1 for all θ\theta, which evidently does not integrate to 1 over all values of θ\theta.

    To take a different tack, the fact that the likelihood function is invariant to reparametrization also illustrates that it is not a probability density of θ\theta (thanks to Gavin Leech for the link). 

July 31, 2021