How long does it take to sample from a distribution?

Suppose a study comes out about the effect of a new medication and you want to precisely compute how to update your beliefs given this new evidence. You might use Bayes’ theorem for continuous distributions.

p(\theta | x) =\frac{p(x | \theta) p(\theta) }{p(x)}=\frac{p(x | \theta) p(\theta) }{\int_\Theta p(x | \theta) p(\theta) d \theta}

The normalization constant (the denominator of the formula) is an integral that is not too difficult to compute, as long as the distributions are one-dimensional.

For example, with:

from scipy import stats
from scipy import integrate

prior = stats.lognorm(scale=math.exp(1),s=1)
likelihood = stats.norm(loc=5,scale=20)
def unnormalized_posterior_pdf(x):
	return prior.pdf(x)*likelihood.pdf(x)
normalization_constant = integrate.quad(
    unnormalized_posterior_pdf,-np.inf,np.inf)[0]

the integration runs in less than 100 milliseconds on my machine. So we can get a PDF for an arbitrary 1-dimensional posterior very easily.

But taking a single sample from the (normalized) distribution takes about a second:

# Normalize unnormalized_posterior_pdf
# using the method above and return the posterior as a
# scipy.stats.rv_continuous object.
# This takes about 100 ms
posterior = update(prior,likelihood) 

# Take 1 random sample, this takes about 1 s
posterior.rvs(size=1) 

And this difference can be even starker for higher-variance posteriors (with s=4 in the lognormal prior, I get 250 ms for the normalization constant and almost 10 seconds for 1 random sample).

For a generic continuous random variable, rvs uses inverse transform sampling. It first generates a random number from the uniform distribution between 0 and 1, then passes this number to ppf, the percent point function, or more commonly quantile function, of the distribution. This function is the inverse of the CDF. For a given percentile, it tells you what value corresponds to that percentile of the distribution. Randomly selecting a percentile $x$ and evaluating the $x$ th percentile of the distribution is equivalent to randomly sampling from the distribution.

How is ppf evaluated? The CDF, which in general (and in fact most of the time¹) has no explicit expression at all, is inverted by numerical equation solving, also known as root finding. For example, evaluating ppf(0.7) is equivalent to solving cdf(x)-0.7=0, which can be done with numerical methods. The simplest such method is the bisection algorithm, but more efficient ones have been developed (ppf uses Brent’s method). The interesting thing for the purposes of runtime is that the root finding algorithm must repeatedly call cdf in order to narrow in on the solution. Each call to cdf means an expensive integration of the PDF.

The bisection algorithm to solve cdf(x)-0.7=0

An interesting corollary is that getting one random number is just as expensive as computing a chosen percentile of the distribution using ppf (assuming that drawing a random number between 0 and 1 takes negligible time). For approximately the cost of 10 random numbers, you could characterize the distribution by its deciles.

On the other hand, sampling from a distribution whose family is known (like the lognormal) is extremely fast with rvs. I’m getting 10,000 samples in a millisecond (prior.rvs(size=10000)). This is not because there exists an analytical expression for its inverse CDF, but because there are very efficient algorithms² for sampling from these specific distributions³.

So far I have only spoken about 1-dimensional distributions. The difficulty of computing the normalization constant in multiple dimensions is often given as a reason for using numerical approximation methods like Markov chain Monte Carlo (MCMC). For example, here:

Although in low dimension [the normalization constant] can be computed without too much difficulties, it can become intractable in higher dimensions. In this last case, the exact computation of the posterior distribution is practically infeasible and some approximation techniques have to be used […]. Among the approaches that are the most used to overcome these difficulties we find Markov Chain Monte Carlo and Variational Inference methods.

However, the difficulty of sampling from a posterior distribution that isn’t in a familiar family could be a reason to use such techniques even in the one-dimensional case. This is true despite the fact that we can easily get an analytic expression for the PDF of the posterior.

For example, with the MCMC package emcee, I’m able to get 10,000 samples from the posterior in 8 seconds, less than a millisecond per sample and a 1,000x improvement over rvs!

ndim, nwalkers, nruns = 1, 20, 500

start = time.time()
def log_prob(x):
    if posterior.pdf(x)>0:
        return math.log(posterior.pdf(x))
    else:
        return -np.inf
sampler = emcee.EnsembleSampler(nwalkers, 1, log_prob)
sampler.run_mcmc(p0, nruns) #p0 are the starting samples

These samples will only be drawn from a distribution approximating the posterior, whereas rvs is as precise as SciPy’s root finding and integration algorithms. However, I think there are MCMC algorithms out there that converge very well.

Here’s the code for running the timings on your machine.

“For a continuous distribution, however, we need to integrate the probability density function (PDF) of the distribution, which is impossible to do analytically for most distributions (including the normal distribution).” Wikipedia on Inverse transform sampling. ↩
“For the normal distribution, the lack of an analytical expression for the corresponding quantile function means that other methods (e.g. the Box–Muller transform) may be preferred computationally. It is often the case that, even for simple distributions, the inverse transform sampling method can be improved on: see, for example, the ziggurat algorithm and rejection sampling. On the other hand, it is possible to approximate the quantile function of the normal distribution extremely accurately using moderate-degree polynomials, and in fact the method of doing this is fast enough that inversion sampling is now the default method for sampling from a normal distribution in the statistical package R.” Wikipedia on Inverse transform sampling. ↩
The way it works in Python is that, in the definition of the class Lognormal (a subclass of the continuous random variable class), the generic inverse transform rvs method is overwritten with a more tailored sampling algorithm. SciPy will know to apply the more efficient method when rvs is called on an instance of class Lognormal. ↩

May 31, 2020

Hidden subsidies for cars

Personal vehicles are ubiquitous. They dominate cities. They are actually so entrenched that they can blend into the background, no longer rising to our attention. Having as many cars as we do can seem to be the ‘natural’ state of affairs.

Our level of car use could perhaps be called natural if it were the result of people’s preferences interacting in well-functioning markets. No reader of this blog, I take it, would believe such a claim. The negative externalities of cars are well-documented: pollution, congestion, noise, and so on.

The subsidies for cars are less obvious, but I think they’re also important.

In our relationship to cars in the urban environment, we’re almost like David Foster Wallace’s fish who asked ‘what the hell is water?’. I want to flip that perspective and point out some specific government policies that increase the number of cars in cities.

“Manhattan, 1964” by Evelyn Hofer

Free or cheap street parking

Privately provided parking in highly desirable city centres can cost hundreds of dollars a month. But the government provides car storage on the side of the street for a fraction of that, often for free.¹

The width of roads

Streets and sidewalks sit on large amounts of strategically placed land that is publicly owned. Most of that land is devoted to cars. On large thoroughfares, I’d guess cars take easily 70% of the space, leaving only thin slivers on each side for pedestrians.

This blogger estimates, apparently by eyeballing Google Maps, that streets take up 43% of the land in Washington DC, 25% in Paris, and 20% in Tokyo.

Space that is now used for parked cars or moving cars could be used, for example, by shops and restaurants, for bikeshare stations, to plant trees, for parklets, or even to add more housing. And if there was a market for this land I’m sure people would come up with many other clever uses.

Highways

Even if highways aren’t actually inside the city, they have important indirect effects on urban life. Whether the government pays for highways or train lines to connect cities to each other is a policy choice with clear effects on day to day life in the city, even for those who do not travel.

In the United States, this implicit subsidy for cars is large. According to the department of transportation, in 2018 $49 billion out of the department’s budget of $87 billion was spent on highways².

In this post I don’t want to get into the very complicated question of how much governments should optimally spend on highways. For all I know the U.S. policy may be optimal. My point is only that any government spending on highways indirectly subsidises the presence of cars in cities. This is non-obvious and worth pointing out. When the government pays for a Metro in your city, the subsidy to Metros plain to see. Meanwhile, the subsidy to cars via a huge network of roads across the country passes unnoticed by many.

To be fair, in the United States federal spending on highways is largely financed by taxes on vehicle fuel. So it’s not clear whether federal highways policy is a net subsidy to cars. However, the way highway spending is financed varies by country. For example, in Germany, “federal highways are funded by the federation through a combination of general revenue and receipts from tolls imposed on truck traffic”.

Minimum parking requirements

Many zoning codes require new buildings to include some fixed number of off-street parking spaces. This isn’t as much of a problem in the European cities I’m familiar with, but in the US, parking minimums are far beyond what the market would provide, and are a significant cost to developers. One paper estimated that the cost of parking in Los Angeles increases the cost of office space by 27-67%³.

Suburban sprawl

United States built sprawling suburbs in the postwar period. I still remember the famous aerial view of Levittown, the prototypical prefabricated suburb, from my middle school history book.

The growth of suburbia was aided by specific government policies that tipped the scales in favour of individual homes in the suburbs, and against apartments in cities. The growth of suburbia led to more cars in the city, because people who live in suburbs are much more likely to drive to work.

Devon Zuegel has an excellent exposition of how federal mortgage insurance subsidized suburbia⁴:

[The federal housing administration] provides insurance on mortgages that meet certain criteria, repaying the principal to lenders if borrowers default. […] Mortgages had to meet an opinionated set of criteria to qualify for the federal insurance. […] The ideal house had “sunshine, ventilation, scenic outlook, privacy, and safety”, and “effective landscaping and gardening” added to its worth. The guide recommended that houses should be set back at least 15 feet from the road, and well-tended lawns that matched the neighbors’ yards helped the rating. […] [The FHA manual] prescribed minimum street widths and other specific measurements.

The federal government was effectively prescribing how millions of Americans should live, down to their landscaping and gardening! I wonder if Khrushchev brought up this interesting fact about American life in his conversations with Eisenhower. ;)

Why scientific fraud is hard to catch

It’s nearly impossible to catch a scientific fraudster if they’re halfway competent.

Uri Simonsohn has become a minor nerd celeb by exposing fraudulent academic scientists who used fabricated data to get published. The Atlantic called him “the data vigilante”. I’ll describe two simple statistical techniques he has used – and why I’m pessimistic about the impact of such techniques.

If a parameter is measured with many significant digits, the last digit should be distributed uniformly 0-9. In a study of an intervention to increase factory workers’ use of hand sanitizer, sanitizer use was measured with a scale sensitive to the 100th of a gram. But the data had an unusual prevalence of 6s, 7s and 9s on the last digit. Uri Simonsohn and colleagues conducted a chi-square test and reject the hypothesis that the digits follow a uniform distribution, p=0.00000000000000001.¹

A second sign of fraudulent data is if the baseline means are too similar between treatment groups. In one of the hand sanitizer studies, there were 40 participants, 20 in the control condition and 20 in the treatment condition. Simonsohn used a “bootstrapping” technique – randomly shuffling the 40 observations into two groups of 20, and repeating this millions of times, in order to estimate how often we would see such similar means if the data were truly drawn randomly (less than once in a 100,000)².

There are other, more mathematically intense techniques for forensic data analysis³, but the common theme among them is to detect fraudsters creating suspiciously non-random data.

I want to tell these hand sanitizer people: come on, how hard can it be to use a random number generator? We know people are bad at producing randomness. In poker, it’s often optimal to play a mixed strategy, which requires randomising your play. But we have a strong natural tendency to play non-randomly, so poker players have developed ad hoc randomisation devices, like looking at your watch and playing call if you’re in the first half of the minute and fold if you’re in the second half. A similar incapacity to produce enough randomness seems to have befallen these amateurish scientific fakers. In order to produce data that violates the last-digit-uniformity law, you have to literally be writing the fake numbers by hand into a computer!

Savvier baddies would not shoot themselves in the foot in this way. It’s very easy to just draw some random numbers from a pre-specified distribution.

I can imagine that as you run more complex experiments, with multiple treatment arms and many potentially correlated parameters, it becomes difficult to create realistic fake data, even if you randomly draw it from a distribution. Some inconsistency could always escape your notice, and a sufficiently determined data sleuth might catch you.

But there’s a much easier solution: just run a legitimate experiment, and then add a constant of your choice to all observations in the treatment group. This data would look exactly like the real thing – the only lie would be that the “treatment” was you logging on to the computer in the middle of the night and changing the numbers. I can’t think of any way this misconduct could be detected statistically. And it has the additional benefit that you’re running an experiment, so people in your department won’t be wondering where you’re getting all that data from.

Statistical sleuthing is fun, but I suspect it’s powerless against the majority of fraud.

My broader hope is that we’ll see a rise in the norm of having multiple independent replications of a study. This single tide should wash away many of the problems with current science. If a study fails to replicate multiple times, the result will lose credibility – even if we never find out whether it was due to outright fraud or merely flawed science.

http://datacolada.org/74, Figure 2 ↩
http://datacolada.org/74, Problem 4 ↩
see the “fake data” category of Simonsohn’s blog Data Colada, which by the way is excellent on many topics besides fraud. ↩

August 2, 2019

« Prev 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Next »

bayes.net