The relative importance of hardware and software progress: evidence from computer chess

I did this analysis in September 2018. Soon after, my career took a turn away from this topic, and I never bothered to publish it. In September 2022, I finally got around to sharing it here. This page is simply the unmodified LaTeX file I used at the time, converted to Markdown using Pandoc. Some tables are included as screenshots because it was simpler that way.

I don't want to invest the time to vet this piece; it's plausible that it contains a mistake, or is otherwise embarrassing. It's a snapshot from 2018.

Given the increase in attention to AI forecasting in recent years, I also suspect this piece is well behind the state of the art today.

Summary

One topic within AI forecasting is: what is the relative importance of hardware and software to AI progress (Grace 2015, 2017)?

In order to make such forecasts, one option is to look at past events in a relevant reference class. In this document, I present new evidence on the relative importance of hardware and software in explaining the last 33 years of progress in computer chess. I construct and analyse a novel dataset using previously unexploited raw data of 54,919 games organised by the Swedish Computer Chess Association between 1985 and 2018.

This dataset contains instances of the same chess programme run on a range of hardware setups, as well as cases where a single piece of hardware was used to run many different programmes. This allows me to isolate the independent effect of hardware and software on performance.

One approach is to use dummy variables to directly capture the effect of each chess programme, controlling for clock speed. This approach estimates the effect of a clock speed doubling at 76 Elo. As for the dummy coefficients, they tell us, for example, that the software improvement from Fritz 1 to Deep Fritz 8 was 462 Elo. Further interpreting these numbers would require enough background knowledge of computer chess to develop an intuitive sense of how much intellectual progress particular algorithms represented.

A second approach introduces the date a programme was released as a variable, instead of dummies. Because I measure only a proxy for the date of release, this estimate is noisier, but it is more readily interpretable. The date-based estimate suggests that every additional year of chess programme development produces a gain of 10 to 20 Elo, while every clock speed doubling produces an increase of between 69 and 120 Elo, depending on the specification. In a CPU database covering 1971-2014, a clock speed doubling occurs every 3.46 years, suggesting that hardware has been roughly twice as important as software in explaining historical chess progress.

This echoes previous findings that ‘gains from algorithmic progress have been roughly fifty to one hundred percent as large as those from hardware progress’ (Grace 2013). However, the present report goes beyond previous analyses in two ways. First, it ensures that Elo score comparisons are meaningful by using data from a single chess rating list. In addition, it uses multiple regression to more systematically analyse a larger dataset covering a longer time period.

The Elo system

Performance in chess is traditionally measured using Elo scores. In the Elo system, according to Wikipedia,

Performance isn’t measured absolutely; it is inferred from wins, losses, and draws against other players. Players’ ratings depend on the ratings of their opponents, and the results scored against them. The difference in rating between two players determines an estimate for the expected score between them. […] A player’s expected score is their probability of winning plus half their probability of drawing.

If Player A has a rating of \(R_A\) and player B a rating of \(R_B\), the expected score of Player A is \(E_A=(1 + 10^{(R_B-R_A )/400})^{-1}\).

When a player’s actual tournament scores exceed their expected scores, the Elo system takes this as evidence that player’s rating is too low, and needs to be adjusted upward. Similarly when a player’s actual tournament scores fall short of their expected scores, that player’s rating is adjusted downward. Elo’s original suggestion, which is still widely used, was a simple linear adjustment proportional to the amount by which a player overperformed or underperformed their expected score. The maximum possible adjustment per game, called the K-factor, was set at K = 16 for masters and K = 32 for weaker players.

Supposing Player A was expected to score \(E_{A}\) points but actually scored \(S_{A}\) points. The formula for updating their rating is \(R_{A}^{\prime }=R_{A}+K(S_{A}-E_{A})\).

This update can be performed after each game or each tournament, or after any suitable rating period. An example may help clarify. Suppose Player A has a rating of 1613, and plays in a five-round tournament. He or she loses to a player rated 1609, draws with a player rated 1477, defeats a player rated 1388, defeats a player rated 1586, and loses to a player rated 1720. The player’s actual score is (0 + 0.5 + 1 + 1 + 0) = 2.5. The expected score, calculated according to the formula above, was (0.51 + 0.69 + 0.79 + 0.54 + 0.35) = 2.88. Therefore, the player’s new rating is (1613 + 32(2.5 - 2.88)) = 1601, assuming that a K-factor of 32 is used. Equivalently, each game the player can be said to have put an ante of K times their score for the game into a pot, the opposing player also puts K times their score into the pot, and the winner collects the full pot of value K; in the event of a draw the players split the pot and receive K/2 points each.

Simpler approaches and why they fail

Approaches which use only variation in software cannot compare the effects of software and hardware

CCRL (Computer Chess Rating Lists) is an organisation that tests computer chess programmes’ strength by playing the programmes against each other. Each programme is given the same thinking time, or “time control”. On the CCRL 40/40 list, the programmes have the equivalent of 40 minutes for 40 moves on a AMD X2 4600+ processor at 2.4GHz. The programme Crafty 19.17 BH is run as a benchmark on the tester’s computer to determine the equivalent time control for their machine.

All programmes on the CCRL list use equivalent hardware, so any difference in their performance can be attributed to software. Grace (2013) writes:

To confirm that substantial software progress does occur, we can look at the CCRL (2013) comparison of Rybka engines. Rybka 1.1 64-bit was the best of its time (the year 2006), but on equivalent hardware to Rybka 4.1 it is rated 204 points worse (2,892 vs. 3,102)

So we can estimate the increase in Elo scores on the CCRL 40/40 list that is due to improvements in software. But we still don’t know how this compares to the improvements that have come from hardware. For this we need variation in both software and hardware. Furthermore, we need variation in both hardware and software within a single chess ratings list. It may not be valid to compare the Elo improvement from software in one list to the Elo improvement from hardware estimated from a different list1. This is because Elo scores cannot be directly compared across lists. The Elo system only measures the relative performance of players. In human FIDE (World Chess Federation) chess, Elo ratings are able to be calibrated on a single scale, because the same humans play under identical conditions across different tournaments (there is a single time control for all major FIDE events). According to Wikipedia, computer chess rating lists have no direct relation to FIDE Elo ratings:

there is no calibration between any of these [computer] rating lists and [human] player pools. Hence, the results which matter are the ranks and the differences between the ratings, and not the absolute values. Also, each list calibrates their Elo via a different method. Therefore, no Elo comparisons can be made between the lists.

For example, in the CCRL 40/4 list, the top rated programme has an Elo of 3560, while the current champion of the 40/40 list, which allows 10 times more thinking time, has an Elo of only 3439.

Estimating software progress as what is left unexplained by hardware suffers from omitted variable bias

Supposing we had data from a single ratings list with variation in both software and hardware, we might reason as follows. A chess programme is a piece of software run on some hardware, so hardware and software are jointly exhaustive categories of inputs to chess performance. Hence whatever difference in performance isn’t explained by hardware must be due to software, and vice versa. One could run a regression of Elo scores on some measure(s) of hardware (say, processor clock speed and RAM), the residuals of which would be a hardware-adjusted Elo score. Any change in hardware-adjusted Elo scores would be due to software. However, this line of reasoning is flawed. Hardware is not exogenously determined, it is correlated with the residual “software” term in the regression. Modern chess programmes are run on much better hardware, So the regression would suffer from omitted variable bias in the direction of overestimating the effect of hardware.

This document’s approach

The SSDF dataset

On computer chess ranking lists, hardware is usually scrupulously standardised in order to compare chess programmes in the fairest way possible.

I could find only one source of data in which hardware varies: the SSDF (Swedish Computer Chess Association) list, which records chess games since 1985 and has changed its hardware several times. I process and merge SSDF’s raw data to produce a novel dataset of 366 programmes-hardware pairs and their Elo ratings. The data methods are detailed in section 5.

Measuring software

If we want to include software as an independent variable, we need to measure the “quality of software” of each chess programme in a meaningful and interpretable way. I use two different approaches.

Using programme dummies

When SSDF changed their hardware in 2008, they commented2:

Six [programmes] have been tested on both Q6600 and Athlon 1200 MHz, which makes a comparison possible. The total effect of a faster processor, four instead of one CPU and the use of 64-bit operating system instead of 32 bit has in average given a rating increase of 120 points. Deep Fritz 8 gained the most with 142 points whereas Deep Junior 8 increased the least with 84 points.

This example allows us to compare the gains due to hardware (Athlon to Q6600) to the gains due to software (e.g. Deep Fritz 8 to Deep Rybka 3). We have a 2x2 table of Elo scores:

     
  Q6600 Athlon
Deep Rybka 3 3193 3075
Deep Fritz 8 2898 2781

The generalisation of this approach is to conduct a regression in which \(n-1\) dummy variables are created for \(n\) programmes in the dataset. This allows us to quantify the effect of moving from some “baseline programme” (which has no dummy) to any other programme, controlling for hardware.

I am able to identify 46 instances where a programme was used on more than one hardware configuration, totalling 96 data points (because four programmes were used on three configurations). Out of these, 82 have clock speed data available3. I use the natural log of clock speed throughout, in keeping with Moore’s law. I use Fritz 1 as the baseline programme.

Table 1 presents the results. As expected, this regression explains virtually all the variation, since hardware and software are jointly exhaustive categories of computer chess inputs. The regression coefficient on clock speed is estimated with very high precision.

Since we are using log clock speed, the interpretation of the coefficient is that for every clock speed doubling, \(\ln(2)*110 \approx 76\). This is about the same as the difference between Conchess Glasgow and Fritz 1. For comparison, data from the Stanford CPU Database (Danowitz et al. 2012), which covers 1971-2014, suggests that a clock speed doubling has historically occurred every 3.46 years.

We can see the progression of the Fritz engines, controlling for clock speed:

     
  Improvement from Fritz 1 Release Date
Fritz 1 0 1992
Fritz 3 192 1995
Deep Fritz 395 2000
Deep Fritz 7 428 2002
Deep Fritz 8 462 2004

The Fritz line of software has improved by about 39 Elo per year on average, suggesting that software has been responsible for almost twice as much progress as hardware when it comes to Fritz.

img.png

Using the date of release

Another approach is to use the release date of a programme as a (very noisy) measure of the amount of developer effort that went into its design. There are two main advantages to this approach. First, more data is available since we are not restricted to instances where a programme was used on more than one hardware configuration. Second, the regression coefficient is easily interpretable as measuring the effect of one year of computer chess development on performance. The main disadvantage is the noisiness of the proxy.

To approximate the release date of a programme, I use the date at which the programme was first played in an SSDF game. The approximation is very good for the most part, except in some cases where the organisers intentionally test very old “legacy” programmes on new hardware for the first time.

I conduct two regressions, presented in Table 2. Regression (1) (\(n=233\)) uses release date and clock speed only. Regression (2) (\(n=138\)) employs a finer-grained measure of hardware that includes clock speed, RAM, and the product of clock speed and number of cores (total speed). All coefficients are estimated with very high statistical precision.

It appears that after accounting for clock speed, the number of cores tells us little about performance, while the coefficient on RAM is actually small and negative. This could be because the small amount variation in RAM and the number of cores in this dataset (see section 5.2.3) is highly correlated with changed in clock speed. The effect of clock speed is of similar magnitude to that found using dummies: between 69 and 120 Elo per doubling. The effect of the release date is between 10 and 20 Elo points per year. Recall that a clock speed doubling has historically occurred every 3.46 years (Danowitz et al. 2012). This would suggest that hardware has mattered roughly twice as much as software for progress in computer chess.

img_1.png

Data processing methods

Sourcing and importing the data

SSDF ranking

The file ssdf-summary-data-original.txt contains the current rating of 366 programme-hardware pairs in tab-separated plain text format, as well as information on hundreds of games. I truncate the file to retain only the ranked list, and convert it to CSV ssdf-summary-data-clean.csv. I call this dataset su.

SSDF data on 54,919 games

I download the plain-text database SSDF.PGN, which fully describes each of the 54,919 games (including every move made by each player). An example entry:

[Event "Testspel av Tony Hed"]

[Site "?"]

[Date "1992.01.01"]

[Round "?"]

[White "Fritz 1 486/33 MHz"]

[Black "Mephisto Academy 6502 5 MHz"]

[Result "0-1"]

[ECO "A28"]

[PlyCount "83"]

1. c4 e5 2. Nc3 Nf6 3. Nf3 Nc6 4. d4 e4 5. Ng5 h6 6. Ngxe4 Nxe4 7. Nxe4 Qh4 8. Nc3 Qxd4 9. e3 Qxd1+ 10. Kxd1 Be7 11. Nd5 Bd8 12. Be2 O-O 13. Bd2 d6 14. Bc3 Ne5 15. Ba5 c6 16. Bxd8 Rxd8 17. Ne7+ Kh7 18. Nxc8 Raxc8 19. Ke1 d5 20. b3 dxc4 21. Bxc4 Rc7 22. Rd1 Rxd1+ 23. Kxd1 Rd7+ 24. Kc2 Nxc4 25. bxc4 Kg6 26. Rd1 Rxd1 27. Kxd1 Kf5 28. Ke2 c5 29. h3 a6 30. Kd3 b5 31. f4 g5 32. g3 bxc4+ 33. Kxc4 g4 34. e4+ Kxe4 35. hxg4 f6 36. Kxc5 Kf3 37. g5 hxg5 38. Kd5 Kxg3 39. fxg5 fxg5 40. Kc5 g4 41. Kb6 Kf4 42. Kxa6 0-1

I write the python script extract-earliest-dates.py, which uses regular expressions to clean up the file. In the resulting output3.txt, there is one line for every game; each lines gives the date and the programme-hardware pair which played White, separated by a comma. I call this data full.

Data analysis in R

Cleaning both data sets

All subsequent data cleaning and analysis are conducted in r.R. In the source data, the software-hardware pairs are in unstructured plain text. For example:

Rebel Century 3 K6-2 450 MHz

P.Fritz 3 Glaurung 2.1 PXA270 520 MHz

Pocket Fritz 2 Shredder PXA255 400 MHz

Goliath Light K6-2 450 MHz

Fritz 5.32 64MB P200 MHz MMX

Crafty 17.07 CB K6-2 450 MHz

Nimzo 99 K6-2 450 MHz

Resurrection Rybka 2.2 ARM 203 MHz

MChess Pro 8 K6-2 450 MHz

Genius 6.5 K6-2 450 MHz

Chessmaster 6000 64MB P200 MHz MMX

Hiarcs 7.32 64MB P200 MHz MMX

Fritz 5 PB29% 67MB P200 MHz MMX

ChessGenius 3 ZTE Apex3 ARM A53 1.3 GHz

Fritz 4 Pentium 90 MHz

Kallisto 1.98 Pentium 90 MHz

MChess Pro 5 486/50-66 MHz

Rebel 7 486/50-66 MHz

Significant data cleaning is required. I define 5 functions using regular expressions, which are applied in the order below. For consistency, I apply the same functions to both su and full.

  • extract_clock_speeds() uses two different methods. First, it matches strings of the form 2.4 GHz where a decimal number is followed by GHz or MHz. Second, it captures anything to the right of a processor (see extract_processors() below), since clock speeds are usually given immediately after the name of the processor.

  • clean_clock_speeds() removes extraneous characters and turns strings with GHz and MHz into numeric values in MHz.

  • extract_processors() catches anything that matches a manually collected list of 32 processors.

  • standardise_progs() removes whitespace and trailing zeroes. Next, and somewhat contentiously, it removes x64 and MP (“multi-processor”). This is because these labels are inconsistently applied between the two datasets, and to my understanding also within SSDF.PGN. Keeping these labels would make merging much more difficult, and the date-based analysis relies crucially on merging datasets. For the dummies analysis, the choice is less defensible, but allows much more data to be used. The hope is that there is not too much difference between the x64 and MP versions and other versions of a single programme. If I did further work with this data, I would look at the impact of these choices on the results.

  • clean_progs() removes extraneous characters

Processing full

full contains missing dates, indicated using question marks. The function unknown_date_default() deals with missing values in the following way:

  • An unknown year defaults to 2100, since we are only interested in the earliest appearance of an programme

  • An unknown month or day defaults to 01 (the more principled choice would be 31 for a missing day and 12 for a missing month, but this would require a more cumbersome regular expression, and I don’t need that much precision)

Then I drop all observations that are not the earliest appearance of an programme. I also remove games that are claimed to have occurred in 1900.

Imputing RAM and number of cores

The number of cores in a processor, and the amount of RAM on the computer that was used for a game are usually not given in the source data (only the clock speed is). This data needs to be added manually. On hardware.htm, the following is written:

The hardware of the different hardware levels:

  • Intel Pentium 90 MHz - 8-16 MB RAM

  • Intel Pentium 200 MHz MMX - 32-64 MB RAM

  • AMD K6-2 450 MHz - Single processor, 32 bit OS, 128 MB RAM, 5 piece TableBases on HHD

  • AMD Athlon Thunderbird 1200 MHz - Single processor, 32 bit OS, 256 MB RAM, 5p TB on HHD

  • Intel Core2Quad Q6600 2400 MHz - Quad Core processor, 64 bit OS, 2 GB RAM, 5p TB on HHD

  • AMD Ryzen 7 1800X 3600 MHz - Octa Core processor, 64 bit OS, 16 GB RAM, 6p Syzygy on SSD (or 5p Nalimov on SSD)

I manually write this in processor-info.csv, to be merged later. Further work on this could involve digging up more such information. Sometimes, when a new version of the list is released, SSDF adds a comment on comment.htm. I have extracted all such comment pages from the internet archive, they can be found in the directory datacomments. I have not studied them, but more information on the hardware setups that were historically used could be hidden there.

Merging

For the date-based analysis, I now finally merge su and full, matching by the programme name. I also merge the imputed information in ramcores.

For the dummies analysis, I only merge ramcores.

When information on the number of cores is available, I compute the clock speed times the number of cores, which is the total number of operations per second available to the processor.

Further data

I use the Ruby script mayback_machine_downloader (Hartator 2018) to pull all versions of the entire SSDF site stored on the Web Archive. I have not yet done anything with this data.

When analysing the Stanford CPU database (Danowitz et al. 2012), I use only the summary file processor.csv. I simply regress the date on the natural log of clock speed. Then I multiply the coefficient by \(\ln(2)\) and divide by \(365\).

References

Danowitz, Andrew, Kyle Kelley, James Mao, John P. Stevenson, and Mark Horowitz. 2012. “CPU DB: Recording Microprocessor History.” *Commun. ACM* 55 (4): 55–63. <https://doi.org/10.1145/2133806.2133822>.
Grace, Katja. 2013. “Algorithmic Progress in Six Domains,” 60.
———. 2015. “Research Topic: Hardware, Software and AI.” *AI Impacts*. https://aiimpacts.org/research-topic-hardware-software-and-ai/.
———. 2017. “Effect of Marginal Hardware on Artificial General Intelligence.” *AI Impacts*. https://aiimpacts.org/effect-of-marginal-hardware-on-artificial-general-intelligence/.
Hartator. 2018. “Download an Entire Website from the Wayback Machine.” https://github.com/hartator/wayback-machine-downloader.
  1. To the extent that Elos from different lists are not comparable, analyses such as that in section 5.1.3 of Grace (2013) would be invalidated. 

  2. See datacomments/comment_003.htm 

  3. I do not use the RAM and cores data I have collected (see section 5), since this would reduce the number of data points even more, but such a regression could easily be conducted. 

September 21, 2022

How to run Cronicle (a cron replacement) in a Docker container

I really don’t like cron jobs and crontab:

  • crontab has a horrible syntax (it’s from 1975…)
  • logging the output of jobs needs to be specified manually
  • viewing logs is inconvenient (even just for checking whether a job ran or not!)
  • cron jobs run in a minimal environment that’s inconvenient to modify

Cronicle is a friendlier alternative (“a task scheduler with a web based front-end UI”).

Very important: the default username/password for the web interface is admin/admin. This could let anyone run arbitrary shell commands on your server! Change the password immediately after setting up.

Here’s how to run Cronicle Dockerized.

docker run -d \
  -v /cronicle-data/data:/opt/cronicle/data:rw \
  -v /cronicle-data/logs:/opt/cronicle/logs:rw \
  -v /cronicle-data/plugins:/opt/cronicle/plugins:rw \
  -v /cronicle-data/app:/app:rw \
  --hostname your_hostname.com -p 11531:3012 \
  -e CRONICLE_base_app_url='http://your_hostname.com:11531' \
  --name cronicle \
  bluet/cronicle-docker:latest

You can now point http://your_hostname.com to the correct IP address and visit http://your_hostname.com:11531 in your browser to access a web interface for Cronicle.

Comments:

  • 11531 is a port number chosen randomly. You should generally use 80, the default port used by web browsers; for me that port is occupied by other applications I run on the host web server.
  • The source for the Docker image bluet/cronicle-docker is here.
  • On 7 May 2022, I confirmed that these steps work on brand new server, with revision 3e4211e of bluet/cronicle-docker.
May 7, 2022

How much of the fall in fertility could be explained by lower mortality?

Many people think that lower child mortality causes fertility to decline.

One prominent theory for this relationship, as described by Our World in Data1, is that “infant survival reduces the parents’ demand for children”2. (Infants are children under 1 years old).

In this article, I want to look at how we can precisify that theory, and what magnitude the effect could possibly take. What fraction of the decline in birth rates could the theory explain?

Important. I don’t want to make claims here about how parents actually make fertility choices. I only want to examine the implications of various models, and specifically how much of the observed changes in fertility the models could explain.

Constant number of children

One natural interpretation of “increasing infant survival reduces the parents’ demand for children” is that parents are adjusting the number of births to keep the number of surviving children constant.

Looking at Our World in Data’s graph, we can see that in most of the countries depicted, the infant survival rate went from about 80% to essentially 100%. This is a factor of 1.25. Meanwhile, there were 1/3 as many births. If parents were adjusting the number of births to keep the number of surviving children constant, the decline in infant mortality would explain a change in births by a factor of 1/1.25=0.8, a -0.2 change that is only 30% of the -2/3 change in births.

The basic mathematical reason this happens is that even when mortality is tragically high, the survival rate is still thankfully much closer to 1 than to 0, so even a very large proportional fall in mortality will only amount to a small proportional increase in survival.

Some children survive infancy but die later in childhood. Although Our World in Data’s quote focuses on infant mortality, it makes sense to consider older children too. I’ll look at under-5 mortality, which generally has better data than older age groups, and also captures a large fraction of all child mortality3.

England (1861-1951)

England is a country with an early demographic transition and good data available.

Doepke 2005 quotes the following numbers:

  1861 1951
Infant mortality 16% 3%
1-5yo mortality 13% 0.5%
0-5 yo mortality 27% 3.5%
Survival to 5 years 73% 96.5%
Fertility 4.9 2.1

Fertility fell by 57%, while survival to 5 years rose by 32%. Hence, if parents aim to keep the number of surviving children constant, the change in child survival can explain 43%4 of the actual fall in fertility. (It would have explained only 23% had we erroneously considered only the change in infant survival.)

Sub-Saharan Africa (1990-2017)

If we look now at sub-Saharan Africa data from the World Bank, the 1990-2017 change in fertility is from 6.3 to 4.8, a 25% decrease, whereas the 5-year survival rate went from 0.82 to 0.92, a 12% increase. So the fraction of the actual change in fertility that could be explained by the survival rate is 44%. (This would have been 23% had we looked only at infant survival).

Source data and calculations. Chart not showing up? Go to the .svg file.

So far, we have seen that this very simple theory of parental decision-making can explain 30-44% of the decline in fertility, while also noticing that considering childhood mortality beyond infancy was important to giving the theory its full due.

However, in more sophisticated models of fertility choices, the theory looks worse.

A more sophisticated model of fertility decisions

Let us imagine that instead of holding it constant, parents treat the number of surviving children as one good among many in an optimization problem.

An increase in the child survival rate can be seen as a decrease in the cost of surviving children. Parents will then substitute away from other goods and increase their target number of surviving children. If your child is less likely to die as an infant, you may decide to aim to have more children: the risk of experiencing the loss of a child is lower.5

For a more formal analysis, we can turn to the Barro and Becker (1989) model of fertility. I’ll be giving a simplified version of the presentation in Doepke 2005.

In this model, parents care about their own consumption as well as their number of surviving children. The parents maximise6

\[U(c,n) = u(c) + n^\epsilon V\]

where

  • \(n\) is the number of surviving children and \(V\) is the value of a surviving child
  • \(\epsilon\) is a constant \(\in (0,1)\)
  • \(u(c)\) is the part of utility that depends on consumption7

The income of a parent is \(w\), and there is a cost per birth of \(p\) and an additional cost of \(q\) per surviving child8. The parents choose \(b\), the number of births. \(s\) is the probability of survival of a child, so that \(n=sb\).

Consumption is therefore \(c=w-(p+qs)b\) and the problem becomes \(\max_{b} U = u(w-(p+qs)b) + (sb)^\epsilon V\)

Letting \(b^{*}(s)\) denote the optimal number of births as a function of \(s\), what are its properties?

The simplest one is that \(sb^*(s)\), the number of surviving children, is increasing in \(s\). This is the substitution effect we described intuitively earlier in this section. This means that if \(s\) is multiplied by a factor \(x\) (say 1.25), \(b^*(s)\) will be multiplied more than \(1/x\) (more than 0.8).

When we looked at the simplest model, with a constant number of children, we guessed that it could explain 30-44% of the fall in fertility. That number is a strict upper bound on what the current model could explain.

What we really want to know, to answer the original question, is how \(b^*(s)\) itself depends on \(s\). To do this, we need to get a little bit more into the relative magnitude of the cost per birth \(p\) and the additional cost \(q\) per surviving child. As Doepke writes,

If a major fraction of the total cost of children accrues for every birth, fertility [i.e. \(b^*(s)\)] would tend to increase with the survival probability; the opposite holds if children are expensive only after surviving infancy9.

This tells us that falling mortality could actually cause fertility to increase rather than decrease.10

To go further, we need to plug in actual values for the model parameters. Doepke does this, using numbers that reflect the child mortality situation of England in 1861 and 1951, but also what seem to be some pretty arbitrary assumptions about the parent’s preferences (the shape of \(u\) and the value of \(\epsilon\)).

With these assumptions, he finds that “the total fertility rate falls from 5.0 (the calibrated target) to 4.2 when mortality rates are lowered to the 1951 level”11, a 16% decrease. This represents is 28% of the actually observed fall in fertility to 2.1.

Extensions of Barro-Becker model

The paper then considers various extensions of the basic Barro-Becker model to see if they could explain the large decrease in fertility that we observe.

For example, it has been hypothesized that when there is uncertainty about whether a child will survive (hitherto absent from the models), parents want to avoid the possibility of ending up with zero surviving children. They therefore have many children as a precautionary measure. Declining mortality (which reduces uncertainty since survival rates are thankfully greater than 0.5) would have a strong negative impacts on births.

However, Doepke also considers a third model, that incorporates not only stochastic mortality but also sequential fertility choice, where parents may condition their fertility decisions on the observed survival of children that were born previously. The sequential aspect reduces the uncertainty that parents face over the number of surviving children they will end up with.

The stochastic and sequential models make no clear-cut predictions based on theory alone. Using the England numbers, however, Doepke finds a robust conclusion. In the stochastic+sequential model, for almost all reasonable parameter values, the expected number of surviving children still increases with \(s\) (my emphasis):

To illustrate this point, let us consider the extreme case [where] utility from consumption is close to linear, while risk aversion with regards to the number of surviving children is high. … [W]hen we move (with the same parameters) to the more realistic sequential model, where parents can replace children who die early, … despite the high risk aversion with regards to the number of children, total fertility drops only to 4.0, and net fertility rises to 3.9, just as with the benchmark parameters. … Thus, in the sequential setup the conclusion that mortality decline raises net fertility is robust to different preference specifications, even if we deliberately emphasize the precautionary motive for hoarding children.

So even here, the fall in mortality would only explain 35% of the actually observed change in fertility. It seems that the ability to “replace” children who did not survive in the sequential model is enough to make its predictions pretty similar to the simple Barro-Becker model.

  1. The quote in context on Our World in Data’s child mortality page: “the causal link between infant [<1 year old] survival and fertility is established in both directions: Firstly, increasing infant survival reduces the parents’ demand for children. And secondly, a decreasing fertility allows the parents to devote more attention and resources to their children.” 

  2. As an aside, my impression is that if you asked an average educated person “Why do women in developing countries have more children?”, their first idea would be: “because child mortality is higher”. It’s almost a trope, and I feel that it’s often mentioned pretty glibly, without actually thinking about the decisions and trade-offs faced by the people concerned. That’s just an aside though – the theory clearly has prima facie plausibility, and is also cited in serious places like academia and Our World in Data. It deserves closer examination. 

  3. It should be possible to conduct the Africa analysis for different ages using IMHE’s more granular data, but it’s a bit more work. (There appears to be no direct data on deaths per birth as opposed to per capita, and data on fertility is contained in a different dataset from the main Global Burden of Disease data.) 

  4. All things decay. Should this Google Sheets spreadsheet become inaccessible, you can download this .xlsx copy which is stored together with this blog. 

  5. In this light, we can see that the constant model is not really compatible with parents viewing additional surviving children as a (normal) good. Nor of course is it compatible with viewing children as a bad, for then parents would choose to have 0 children. Instead, it could for example be used to represent parents aiming for a socially normative number of surviving children. 

  6. I collapse Doepke’s \(\beta\) and \(V\) into a single constant \(V\), since they can be treated as such in Model A, the only model that I will present mathematically in this post. 

  7. Its actual expression, that I omit from the main presentation for simplicity, is \(u(c)=\frac{c^{1-\sigma}}{1-\sigma}\), the constant relative risk-aversion utility function. 

  8. There is nothing in the model that compels us to call \(p\) the “cost per birth”, this is merely for ease of exposition. The model itself only assumes that there are two periods for each child: in the first period, costing \(p\) to start, children face a mortality risk; and in the second period, those who survived the first face zero mortality risk and cost \(q\). 

  9. Once again, Doepke calls the model’s early period “infancy”, but this is not inherent in the model. 

  10. It’s difficult to speculate about the relative magnitude of \(p\) and \(q\), especially if, departing from Doepke, we make the early period of the model, say, the first 5 years of life. If the first period is only infancy, it seems plausible to me that \(q \gg p\), but then we also fail to capture any deaths after infancy. On the other hand, extending the early period to 5 incorrectly assumes that parents get no utility from children before they reach the age of 5. 

  11. The following additional context may be helpful to understand this quote:

    The survival parameters are chosen to correspond to the situation in England in 1861 . According to Perston et al. (1972) the infant mortality rate (death rate until first birthday) was \(16 \%\), while the child mortality rate (death rate between first and fifth birthday) was \(13 \%\). Accordingly, I set \(s_{i}=0.84\) and \(s_{y}=0.87\) in the sequential model, and \(s=s_{i} s_{y}=0.73\) in the other models. Finally, the altruism factor \(\beta\) is set in each model to match the total fertility rate, which was \(4.9\) in 1861 (Chenais 1992). Since fertility choice is discrete in Models B and C, I chose a total fertility rate of \(5.0\) as the target.

    Each model is thus calibrated to reproduce the relationship of fertility and infant and child mortality in 1861 . I now examine how fertility adjusts when mortality rates fall to the level observed in 1951 , which is \(3 \%\) for infant mortality and \(0.5 \%\) for child mortality. The results for fertility can be compared to the observed total fertility rate of \(2.1\) in 1951 .

    In Model A (Barro-Becker with continuous fertility choice), the total fertility rate falls from \(5.0\) (the calibrated target) to \(4.2\) when mortality rates are lowered to the 1951 level. The expected number of surviving children increases from \(3.7\) to \(4.0\). Thus, there is a small decline in total fertility, but (as was to be expected given Proposition 1) an increase in the net fertility rate.

August 5, 2021