The war for statistics in the real world
In the last post, I brought up the two prevailing schools of thought in modern statistics, and showed how each of them might tackle a hypothetical scenario. Although this was a toy problem, similar questions have been posed in a very real-world context. Consider the following…
The year was 1944, and the Allies were preparing for the D-Day landings. They planned to use Sherman tanks (made by the US) in the assault, since they performed well against the tanks that the Germans had used thus far in the war – the Panzer III and Panzer IV models. Recently the Germans had begun to use a new model – the Panzer V – with higher-velocity guns and heavier frontal armour, which would have been a worse match-up for the Sherman tanks. These new tanks had only been seen in small numbers so far, but rumours had been circulating that the Germans had been producing these in high numbers. The Allies needed to work out some way of estimating how many of these tanks there were in existence in order to plan their strategy effectively.
Luckily for the Allies, the Germans serialised their tanks. With a few of these serial numbers from captured or destroyed Panzer V’s, the Allies went to work on a statistical investigation. They used a mostly frequentist approach to attack the problem, and did a pretty good job of it. For example. they predicted that 246 Panzer V’s were made in the month of February 1944 – the actual value was later discovered to be 245.
This problem was later revisited by statisticians, some of whom tried to approach the problem from a Bayesian standpoint. It turns out that depending on your specific scenario, methods based on the two distinct approaches can lead to wildly different results, especially if you do silly things like drawing conclusions based on a single tank. The rest of this post will look at a typical frequentist approach and a typical Bayesian approach to the problem, complete with the results that each approach achieve on some example data.
WARNING: MATHEMATICS INBOUND
A typical frequentist approach
Suppose we know the numbers of a few tanks. From this data set, we would like to guess the total number of tanks. As we saw before, the basis of the frequentist paradigm is to do calculations assuming that we have an infinitely long sequence of data points, and then try to adapt for the fact that we only have a finite sequence of results.
In this case we can’t have an infinite sequence of tank numbers, because we only have a finite number of tanks to choose from if we are sampling without replacement. Instead of this, we consider the case where our sequence is all existing tanks in some order. Here we can just pick the largest tank number, and it is guaranteed to be the total number of tanks. However, if we try to apply this strategy to a shorter sequence of tank numbers (uniformly sampled without replacement), there is a non-zero probability of underestimating the true number of tanks.
The maximum sequence value with which we try to estimate the true number of tanks is an example of what is (somewhat unsurprisingly) known as an estimator. If the expected value of the estimator (over all possible sampled sequences) is equal to the true value of the thing we are trying to estimate, the estimator is known as unbiased. This is often a desirable quantity for estimators. Sadly, this is not true of our “sequence maximum” estimator if our sequence doesn’t contain all tank numbers. Let’s see if we can do better.
Suppose that we have total tanks, and we take a sample of tanks sampled uniformly and without replacement from these . We note that the largest serial number sampled is if is in our sample and all other values in our sample are less than (note that this is only possible if ). There are different combinations of sample values that achieve this out of the possible combinations, and so the probability of this occurring is
This allows us to calculate the expected series maximum as
We note that , and use the fact that to rewrite this expression as
We therefore have . Since (we assume that our sequence doesn’t contain all tanks), this value is an underestimate for . However, with a little rearranging, we find that . We have just found an unbiased estimator for the total number of tanks.
Just having your estimator be unbiased is not sufficient for it to be a good estimator – the value given by taking a single sample value, doubling it, and then subtracting 1 is also an unbiased estimator for the total number of tanks. However, what we just derived above turns out to be the minimum-variance unbiased estimator for our sample data – out of all unbiased estimators based on our data it has the least variance. We would be reasonably well off using this.
An example with actual data
To demonstrate how this works with actual data, I imagined a scenario with 100 total tanks. I then sampled 10 tanks uniformly without replacement from these 100, and got the sequence [87,90,2,80,47,32,33,55,94,25]. The maximum number sampled here was 94, so the estimator we discussed above would evaluate to 94*(11/10)+1=102.4, which is not a bad estimate for the true tank total of 100.
A typical Bayesian approach
So now that we’ve spent so long developing our estimator-based model, let’s completely forget about it and start again from scratch. You might remember that the basis of the Bayesian paradigm relies on first choosing a set of beliefs, and then updating them by looking at sample data. This is still a little abstract, so let’s try to make it more concrete.
Suppose that we have the same situation as before: tanks total. We want to take a sample of size . Before are allowed to do anything we need to quantify our prior beliefs. We treat as a random variable and have a guess as to what its distribution might be – we will use this as our set of prior beliefs. What constitutes a sensible guess is highly context-specific, and I will come back to it in the example below.
Now that we have our prior, we can move onto the next step. We actually take our samples and try to make deductions using the new information we learn. I mentioned Bayes’ Rule in the last post – we will actually use it here. Bayes’ Rule states that for any events and , we have
This becomes immediately useful when we assign to be the event that a particular model correctly represents reality, and to be the event that we observe a particular set of observations (in particular, the set of observations we actually make). Then is the probability of this model being true given the observations we have made, and this is exactly the quantity we want.
If we then look at the right-hand side of the equation, we note that we know both (we assume this when we decide our prior) and (easy enough to calculate the probability of an event from a given model). So far we know how to get an expression for for any model in our prior. We can derive by just summing the over all in our prior.
We now have what is known as a posterior probability for each – a new probability value for each model in our prior, updated by considering the data we have. Collectively this set of probabilities is known as a posterior distribution – representing our new set of beliefs.
An example using actual data
Let’s try the above technique on the same data that we used in our example frequentist analysis. As discussed, we use to represent a random variable modelling our prior beliefs on total tank number. Since we are using the same set as before, we would like to set our sample size as 10. Therefore we know that (otherwise we can’t actually take samples), and we might assume that all tank totals are equally likely up to some high bound, say 1000, and that all totals greater than that are impossible (we will pick this prior largely for simplicity’s sake). This would be reflected in the following prior:
Now we start to use our sample data with its maximum value of 94. As we saw in the frequentist example, the probability of the maximum tank number from a sample of size being when taken from a total set of tanks is . We therefore have everything we need to generate our posterior – a prior and the likelihood of generating our data from each model in the prior. The actual calculations are tedious, but we eventually derive our posterior as
which is actually the result regardless of what you set your prior’s “upper bound” to (the value we set to 1000). We have successfully updated our beliefs based on the data, but how do we actually convert this into an estimate for the number of tanks? We want to pick the value that is the “average” of our distribution somehow, but this is not very well-defined, and how we decide to do it will affect the answer we get. In this case due to the skew and monotone decreasing probability mass function, we might consider neither the mode nor the mean to be a good representation of the “average”. The median is probably more reasonable here – if we choose this as our accepted estimate for , then we would guess , which is not a bad guess.
So it seems like the Bayesians have the closer result. Therefore they won, right? Well, not really. Our specific choice of Bayesian approach fared better than our specific frequentist approach, on this very specific data set. That is a far cry than out-and-out superiority. That isn’t to say that one school of thought doesn’t on average get better results than the other, but actually conclusively showing that is certainly beyond the scope of this piece (and also probably the combined powers of every statistician on Earth).
Let’s focus instead on particular pair of words from the last paragraph: “specific choice”. In both of the above approaches there choices that we made (e.g. the Bayesian prior, our choice of estimator). More crucially there were things that we didn’t consider but maybe should have (is there any biased estimator that would do a better job than our optimal unbiased estimator, is median really the best choice of “average”). The key takeaway here is that Bayesianism and frequentism aren’t concrete sets of instructions, but more like guiding principles which are used to build those instructions. This is why you won’t find a single mention of p-values or loss functions in my previous post – I feel like these are merely implementations of the higher-level ideas behind each school of thought.
Also p-values suck. But more about that another time