Featured

What is Artificial Intelligence anyway?

Hmm, that’s quite the puzzler. Luckily, with the miracle of technology at my fingertips, I can employ a very sophisticated technique to attack this problem: go and read what some random people have written on Wikipedia.

If you go to the Wikipedia article on “Artificial Intelligence”, you will find a short section labelled “Definitions”, which contains the following:

Computer science defines AI research as the study of “intelligent agents”: any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.

Wow, that’s impressive – as a (questionably) intelligent human, I don’t think even I can do that.

Suppose I have a difficult goal, say being able to run 100m in under 10 seconds. I don’t really know how to achieve it, so I go to my friend for advice and say “Hey, could you give me some advice on how to achieve my goal of running really fast”. And he’s like “Yeah sure, that’s easy, just take actions that maximise your chance of successfully obtaining your goal”.

Usain Bolt, rejoicing proudly in his ability to maximise the chance of achieving his goal.
Original Photo credit: J. Brichto
Reused under CC BY-SA 3.0 Licence

My friend is obviously right with his advice, but it is difficult to follow for two main reasons:

  • We can’t practically deal with all potential choices that we could make. In the above example there are so many different exercise plans, dietary plans, questionable plans to obtain radioactive superpowers, etc. that we can’t really consider all of them.
  • Given a single choice of action, it is difficult to know what the chance of me achieving my goal is – how likely is it that my toxic waste-based exploits will actually result in super-speed.
Trying to procure some radioacti…er, I mean training supplies.
Photo Credit: ShinRyu Forgers
Reused under CC BY-SA 4.0 Licence

Each of these two points is complex enough to receive many, many posts of its own (don’t worry, they’re on their way), but for now I will just outline a possible solution for each.

Reducing our choices is probably the more obvious one of the two for most people – we in fact do this every day, every single time we make a choice. Whenever I get dressed in the morning I don’t consider what will happen if I put my underpants on my head – I restrict my attention to more sensible choices. We can do something similar in more or less any context (to a varying degree of success depending on our domain-specific knowledge).

To address the second issue, we can use probability theory, which allows us to quantify the chance of an event happening. We can now directly compare choices by success probability. (I am skipping a heck of a lot of details here, all of which I will revisit in time).

If we can do these two things well, and are also able to convert all of this into a format that computers can understand, then we know how to do artificial intelligence. Wasn’t that easy?

Well, I guess this blog achieved its goal. Bye everyone. Go off and build your own world-dominating AI, I guess…

As you might have guessed, it isn’t that quite that simple – we do now have a great way for computers to make “good choices”, but based on some really, really strong assumptions. We have assumed that the way in which we narrow down the choices we consider is “reasonable” in some sense, we have assumed that we know how to assign these “probability numbers” well, and we have assumed that we can “speak computer” well enough to instruct a computer to perform these tasks instead of us.

All three of these are very active areas of research today, so actually satisfying these assumptions is actually really, really hard. Nonetheless, we have at least broken down our original query into a slightly more concrete set of problems – consider this an introduction to the more technical side of AI. In time (and quite a large number of further posts), I will try to address in more detail each of these three problems, so stay tuned.

The German Tank Problem

The war for statistics in the real world

In the last post, I brought up the two prevailing schools of thought in modern statistics, and showed how each of them might tackle a hypothetical scenario. Although this was a toy problem, similar questions have been posed in a very real-world context. Consider the following…

The year was 1944, and the Allies were preparing for the D-Day landings. They planned to use Sherman tanks (made by the US) in the assault, since they performed well against the tanks that the Germans had used thus far in the war – the Panzer III and Panzer IV models. Recently the Germans had begun to use a new model – the Panzer V – with higher-velocity guns and heavier frontal armour, which would have been a worse match-up for the Sherman tanks. These new tanks had only been seen in small numbers so far, but rumours had been circulating that the Germans had been producing these in high numbers. The Allies needed to work out some way of estimating how many of these tanks there were in existence in order to plan their strategy effectively.

Luckily for the Allies, the Germans serialised their tanks. With a few of these serial numbers from captured or destroyed Panzer V’s, the Allies went to work on a statistical investigation. They used a mostly frequentist approach to attack the problem, and did a pretty good job of it. For example. they predicted that 246 Panzer V’s were made in the month of February 1944 – the actual value was later discovered to be 245.

This problem was later revisited by statisticians, some of whom tried to approach the problem from a Bayesian standpoint. It turns out that depending on your specific scenario, methods based on the two distinct approaches can lead to wildly different results, especially if you do silly things like drawing conclusions based on a single tank. The rest of this post will look at a typical frequentist approach and a typical Bayesian approach to the problem, complete with the results that each approach achieve on some example data.

WARNING: MATHEMATICS INBOUND

A typical frequentist approach

Suppose we know the numbers of a few tanks. From this data set, we would like to guess the total number of tanks. As we saw before, the basis of the frequentist paradigm is to do calculations assuming that we have an infinitely long sequence of data points, and then try to adapt for the fact that we only have a finite sequence of results.

In this case we can’t have an infinite sequence of tank numbers, because we only have a finite number of tanks to choose from if we are sampling without replacement. Instead of this, we consider the case where our sequence is all existing tanks in some order. Here we can just pick the largest tank number, and it is guaranteed to be the total number of tanks. However, if we try to apply this strategy to a shorter sequence of tank numbers (uniformly sampled without replacement), there is a non-zero probability of underestimating the true number of tanks.

The maximum sequence value with which we try to estimate the true number of tanks is an example of what is (somewhat unsurprisingly) known as an estimator. If the expected value of the estimator (over all possible sampled sequences) is equal to the true value of the thing we are trying to estimate, the estimator is known as unbiased. This is often a desirable quantity for estimators. Sadly, this is not true of our “sequence maximum” estimator if our sequence doesn’t contain all tank numbers. Let’s see if we can do better.

Suppose that we have N total tanks, and we take a sample of n tanks sampled uniformly and without replacement from these N. We note that the largest serial number sampled is M if M is in our sample and all other values in our sample are less than M (note that this is only possible if M\geq n). There are {M-1 \choose n-1} different combinations of sample values that achieve this out of the {N \choose n} possible combinations, and so the probability of this occurring is \frac{{M-1 \choose n-1}}{{N \choose n}}

This allows us to calculate the expected series maximum as

\begin{aligned} \sum_{M=n}^{N}\frac{M\cdot {M-1 \choose n-1}}{{N \choose n}}=\sum_{M=n}^{N}\frac{n \cdot {M \choose n}}{{N \choose n}} \end{aligned}

We note that {n \choose n} = {n+1 \choose n+1}, and use the fact that {b \choose n} + {b \choose n+1}={b+1 \choose n+1}\ \forall b>n to rewrite this expression as \frac{n}{{N \choose n}}\cdot {N+1 \choose n+1} = (N+1)\cdot\frac{n}{n+1}

We therefore have \mathbb{E}(M)=(N+1)\cdot\frac{n}{n+1}. Since n<N (we assume that our sequence doesn’t contain all tanks), this value is an underestimate for N. However, with a little rearranging, we find that \mathbb{E}(M(1+\frac{1}{n})-1)=N. We have just found an unbiased estimator for the total number of tanks.

Just having your estimator be unbiased is not sufficient for it to be a good estimator – the value given by taking a single sample value, doubling it, and then subtracting 1 is also an unbiased estimator for the total number of tanks. However, what we just derived above turns out to be the minimum-variance unbiased estimator for our sample data – out of all unbiased estimators based on our data it has the least variance. We would be reasonably well off using this.

An example with actual data

To demonstrate how this works with actual data, I imagined a scenario with 100 total tanks. I then sampled 10 tanks uniformly without replacement from these 100, and got the sequence [87,90,2,80,47,32,33,55,94,25]. The maximum number sampled here was 94, so the estimator we discussed above would evaluate to 94*(11/10)+1=102.4, which is not a bad estimate for the true tank total of 100.

A typical Bayesian approach

So now that we’ve spent so long developing our estimator-based model, let’s completely forget about it and start again from scratch. You might remember that the basis of the Bayesian paradigm relies on first choosing a set of beliefs, and then updating them by looking at sample data. This is still a little abstract, so let’s try to make it more concrete.

Suppose that we have the same situation as before: N tanks total. We want to take a sample of size n. Before are allowed to do anything we need to quantify our prior beliefs. We treat N as a random variable and have a guess as to what its distribution might be – we will use this as our set of prior beliefs. What constitutes a sensible guess is highly context-specific, and I will come back to it in the example below.

Now that we have our prior, we can move onto the next step. We actually take our n samples and try to make deductions using the new information we learn. I mentioned Bayes’ Rule in the last post – we will actually use it here. Bayes’ Rule states that for any events A and B, we have

\begin{aligned} P(A|B)=\frac{P(B|A)P(A)}{P(B)} \end{aligned}

This becomes immediately useful when we assign A to be the event that a particular model correctly represents reality, and B to be the event that we observe a particular set of observations (in particular, the set of observations we actually make). Then P(A|B) is the probability of this model being true given the observations we have made, and this is exactly the quantity we want.

If we then look at the right-hand side of the equation, we note that we know both P(A) (we assume this when we decide our prior) and P(B|A) (easy enough to calculate the probability of an event from a given model). So far we know how to get an expression for P(A|B)P(B) for any model A in our prior. We can derive P(B) by just summing the P(A|B)P(B) over all A in our prior.

We now have what is known as a posterior probability for each A – a new probability value for each model in our prior, updated by considering the data we have. Collectively this set of probabilities is known as a posterior distribution – representing our new set of beliefs.

An example using actual data

Let’s try the above technique on the same data that we used in our example frequentist analysis. As discussed, we use N to represent a random variable modelling our prior beliefs on total tank number. Since we are using the same set as before, we would like to set our sample size n as 10. Therefore we know that N\geq 10 (otherwise we can’t actually take n samples), and we might assume that all tank totals are equally likely up to some high bound, say 1000, and that all totals greater than that are impossible (we will pick this prior largely for simplicity’s sake). This would be reflected in the following prior:

\begin{aligned} P(N=K)= \begin{cases} \frac{1}{1000}\text{ if }10\leq K<1000\\ 0\text{ else} \end{cases} \end{aligned}

Now we start to use our sample data with its maximum value of 94. As we saw in the frequentist example, the probability of the maximum tank number from a sample of size n being M when taken from a total set of N tanks is \frac{{M-1 \choose n-1}}{{N \choose n}}. We therefore have everything we need to generate our posterior – a prior and the likelihood of generating our data from each model in the prior. The actual calculations are tedious, but we eventually derive our posterior as

\begin{aligned} P(N=K)= \begin{cases} \frac{9\cdot{93 \choose 9}}{10\cdot{K \choose 10}}\text{ if }K\geq 94\\ 0\text{ else} \end{cases} \end{aligned}

which is actually the result regardless of what you set your prior’s “upper bound” to (the value we set to 1000). We have successfully updated our beliefs based on the data, but how do we actually convert this into an estimate for the number of tanks? We want to pick the value that is the “average” of our distribution somehow, but this is not very well-defined, and how we decide to do it will affect the answer we get. In this case due to the skew and monotone decreasing probability mass function, we might consider neither the mode nor the mean to be a good representation of the “average”. The median is probably more reasonable here – if we choose this as our accepted estimate for N, then we would guess N=101, which is not a bad guess.

The verdict

So it seems like the Bayesians have the closer result. Therefore they won, right? Well, not really. Our specific choice of Bayesian approach fared better than our specific frequentist approach, on this very specific data set. That is a far cry than out-and-out superiority. That isn’t to say that one school of thought doesn’t on average get better results than the other, but actually conclusively showing that is certainly beyond the scope of this piece (and also probably the combined powers of every statistician on Earth).

Let’s focus instead on particular pair of words from the last paragraph: “specific choice”. In both of the above approaches there choices that we made (e.g. the Bayesian prior, our choice of estimator). More crucially there were things that we didn’t consider but maybe should have (is there any biased estimator that would do a better job than our optimal unbiased estimator, is median really the best choice of “average”). The key takeaway here is that Bayesianism and frequentism aren’t concrete sets of instructions, but more like guiding principles which are used to build those instructions. This is why you won’t find a single mention of p-values or loss functions in my previous post – I feel like these are merely implementations of the higher-level ideas behind each school of thought.

Also p-values suck. But more about that another time

The war for statistics

My take on Bayesian and frequentist reasoning

Probability is confusing. It didn’t really make much sense to me (why are random variables neither random nor variables?) until I took my first course in measure theory, which really helped to put things into perspective.

Statistics is even more confusing, adding real-world aspects onto the back of theoretical probability. I hoped that there would be some sort of deeper picture here that would straighten things out for me, in much the same way that measure theory did for probability.

It turns out that there are two. And they don’t like each other very much.

Here is my attempt to decode frequentist and Bayesian mechanics:

Frequentism

Suppose you have a coin on a table in front of you. I tell you that the coin might be biased, and then ask you to tell me what the probability of heads is. You have never seen this coin before, but since it just sitting there, you might decide to flip the coin a few times. At this point, it seems reasonable to guess that the probability of heads is just the proportion of times that heads turned up.

Do Mario coins even have heads and tails? I wonder what high school stats classes look like in the Mushroom Kingdom…

This is a very simple example of frequentist reasoning – the idea that we can get information about a statistical process by sampling it a whole bunch of times (we make decisions based on the frequency of observed events). This is not justified in every case, but it is in fact accurate quite a lot of the time (including any case when we have independent trials, like above), justified by quite a lot of complex background math. If you were to go through all the calculations, you would find as the number of samples you take increases, the sample probability of heads will converge almost surely to the actual probability of heads.

Sadly, it is not possible to flip the coin infinitely many times. Tweaking the above approach to give useful answers with only a finite (but usually large) number of coin flips is largely case-specific, and is the job of the frequentist statistician.

Bayesianism

We have seen how a frequentist might handle the scenario above. Now we will put a Bayesian in the same scenario and see what they might do.

Thomas Bayes.gif
Reverend Thomas Bayes, a random British country pastor in the 1700s who discovered arguably the most useful single result in all of statistics, now known as Bayes’ Theorem.

The crux of Bayesian methods is what is called a “Bayesian prior” – a set of existing beliefs that we lay out before performing statistical trials. For example, maybe we believe strongly that our coin is fair, and our degree of belief in our coin having probability of heads p decreases as p strays from 1/2. We then sample – this could be a single coin flip or several – and update our beliefs based on what we observe.

Suppose that we do a bunch of coin flips and get a long sequence of heads. We would then update our beliefs to reflect this – we believe more strongly that p is larger than half, and the larger it is, the more strongly we believe. We still don’t have a direct answer to the question we were asked, but we do have a representation for our degree of belief that the probability of heads is p, for each value of p. Based on this there are a few sensible answers we could give – pick the value that we believe most in, take an average weighted by degree of belief, etc.

Conclusion

As you might have noticed, neither of these descriptions are complete – how does the frequentist approximate an infinite sequence of trials with only finitely many, how does the Bayesian choose their prior – but I think this gives a good idea of the philosophy behind each approach.