In this post I am going to introduce a pretty famous riddle, made popular recently by the police sitcom Brooklyn Nine-Nine as well as the idea of the entropy of a probability distribution, made popular by Claude Shannon. Then I am going to go through a solution that is presented in Information Theory, Inference, and Learning Algorithms (2), a brilliant book on the topic by the late David MacKay, as well as some intuitions from his lecture series on the topic. Hopefully, by the end of it, you will be familiar with another property of a probability distribution and be able to impress your friends with your riddle-solving abilities.

The Riddle

The riddle is presented by Captain Holt (pictured above) to his team of detectives as follows (1):

‘There are 12 men on an island, 11 weigh exactly the same amount, but 1 of them is slightly lighter or heavier: you must figure which.* The island has no scales, but there is a see-saw. The exciting catch? You can only use it three times.’**

So this is a pretty easy problem to state but it is quite tricky to solve. There are a number of solutions online- some of which are faster to read than this one, but I think this is a good way to introduce informational entropy and help to solve similar problems in the future.

* to stay true to the episode: the islanders may look like Tyrese, if you so wish

**there is no promise of Beyoncé tickets in this post

Informational entropy


Claude Shannon

Now for some maths. I am not going to introduce more notation than is necessary, but we will need some basics. An ensembleX, is a triple (x, \mathcal{P},\mathcal{A}) where is the value that a random variable can take on, coming from \mathcal{A} and with each of the values having a probability in \mathcal{P}.

Then the informational entropy of the ensemble is as follows:

H(X) = \displaystyle \sum_{i} p_i \log_2(\frac{1}{p_i})

Note that this form of this is as an expectation, specifically: \mathbb{E} [\frac{1}{p}].

This can be shown to be the correct way to measure the information content of this ensemble. If we think about the ensemble as being an experiment, where we know all the possible outcomes and probabilities thereof, then we can design experiments that give us as much information as possible. This will be the principle that is followed by the optimal solution below.

A way to think about entropy is: on average, how surprised do we expect to be when seeing a realisation of the random variable.

To get a little intuition about how this works we can consider three ensembles. In each case there will be a random variable taking on values (x_1,x_2,x_3) with probabilities (p_1,p_2,p_3). We will vary the probabilities to illustrate how changes in these can lead to different levels of information. The three probability tuples that we will consider are (1,0,0), (0.5,0.25,0.25) and (\frac{1}{3},\frac{1}{3},\frac{1}{3}).

Their respective entropies are:

1\log_2(1) + 0 + 0 = 0

0.5\log_2(2) + 0.25\log_2(4) + 0.25\log_2(4) = 1.5

\frac{1}{3}\log_2(3) +\frac{1}{3}\log_2(3) + \frac{1}{3}\log_2(3) = 1.58

Note how the value of the probability measure increases as the probability of the outcomes become closer to one another. This aligns with our intuition on feeling surprised: if one outcome is certain there is no surprise, if all outcomes are equally likely then we usually be pretty surprised. When we design experiments it is important to be as ‘surprised’ as possible by the outcome.

To take from MacKay:

the outcome of a random experiment is guaranteed to be most informative if the probability distribution over outcomes is uniform.

An information-theoretic approach to the solution

Now that we have some intuition on what a good solution might include we can think about the solution itself.

We start out by enumerating the possibilities: there are 12 islanders (numbered 1-12 based on how closely they resemble Tyrese) and each can be heavier or lighter, so that leaves us with 24 possible situations. We represent these as {1^+,2^+,3^+,4^+,5^+,6^+,7^+,8^+,9^+,10^+,11^+,12^+,1^-,2^-,3^-,4^-,5^-,6^-,7^-,8^-,9^-,10^-,11^-,12^-} with the superscripts indicating whether each islander is heavier or lighter. The outcomes of each experiment can be ‘left side heavier’: L, ‘balances’: B, ‘right side heavier’: R. At the end of each use of the scale we make a note about which possibilities we have not eliminated and the probabilities of each of the outcomes of the experiment. Let’s do it:

First Use:

Weigh islanders 1,2,3,4 against 5,6,7,8.

L: means that 1,2,3,4 weigh more than 5,6,7,8, which leaves us with: {1^+,2^+,3^+,4^+,5^-,6^-,7^-,8^-}. So there are 8 possibilities that could be true here so the probability of this outcome is \frac{8}{24} = \frac{1}{3}

R: means that 1,2,3,4 weigh less than 5,6,7,8. This, similarly, leaves us with: {1^-,2^-,3^-,4^-,5^+,6^+,7^+,8^+} and also has a probability of \frac{8}{24} = \frac{1}{3}.

B: means that 1,2,3,4 weigh the same as 5,6,7,8. This tells us that the islander with a different weight is 9,10,11 or 12, but we still do not know if he is heavier or lighter. So this outcome happens when we have: {9^+,10^+,11^+,12^+,9^-,10^-,11^-,12^-}, which, of course, also has a probability of \frac{8}{24} = \frac{1}{3}.

The entropy of this ensemble is clearly just the same as our last example above: 1.58.

Second use:

For the second use we now have 3 possible cases. This makes it a bit more work. But if we notice that observing L and R in the first use give us the same information about different islanders then we can reduce it somewhat. When we use the scale for the second time we have also only got to consider the possibilities that have not been eliminated by the first use, which we can refer to above.

Case 1: First use was L

We weigh 1,2,6 against 3,4,5

L: means that 1,2,6 weigh more than 3,4,5 which leaves us with: {1^+,2^+,5^-}. So there are 3 possibilities that could be true here (out of the 8 that we started with) so the probability of this outcome is \frac{3}{8}

R: means that 1,2,6 weigh less than 3,4,5. This, similarly, leaves us with: {3^+,4^+6^-}.and also has a probability of \frac{3}{8}.

B: means that 1,2,6 weigh the same as 3,4,5. This tells us that the islander with a different weight is 7 or 8 So this outcome happens when we have: {7^-,8^-}, which has a probability of \frac{2}{8}.

The entropy here is: 1.56

Case 2: First use was B:

We weigh 9,10,11 against 1,2,3

L: means that 9,10,11 weigh more than 1,2,3 which leaves us with: {9^+,10^+,11^+}. So there are 3 possibilities that could be true here (out of the 8 that we started with) so the probability of this outcome is \frac{3}{8}

R: means that 9,10,11 weigh less than 1,2,3. This leaves us with: {9^-,10^-,11^-}.and also has a probability of \frac{3}{8}.

B: means that 9,10,11 weigh the same as 1,2,3. This tells us that the islander with a different weight is 12 and could be heavier or lighter. So this outcome happens when we have: {12^+,12^-}, which has a probability of \frac{2}{8}

The entropy here is: 1.56

Case 3: First use was R.

We work in a similar fashion to L. (The full method is shown in the picture below, also taken from (2))Screenshot 2019-09-10 at 01.20.38.

Third use:

By this stage it is a lot easier to see what needs to be done. For any of our situations with 3 remaining options, we can compare any 2: if the weights differ we refer to the remaining possibilities and infer the answer, if not then the islander with a different weight is the one not being weighed on the third use. For the case when we are left with  {12^+,12^-}, we simply weigh 12 against any of the other islanders and we are done.

So that’s it- we find our odd islander in 3 uses of the see-saw!

What about the entropy?

So what have we been calculating the entropy for? The reason for showing the entropy at each step was to demonstrate that a) it is the highest possible entropy for each experiment and b) to show that we were actively thinking about maximising the information that we got from each experiment.

What about the alternatives?

As I noted before, there is more than one way of solving this, but here I will deal with some of the common initial ideas:

Weigh 6 islanders against the other 6. This can only result in one of the 2 sides being heavier, each with probability 0.5. This gives an entropy of 1, which is lower than what we observed with our chosen path.

What about the second use, Case 1? What if we had weighed 1,2,3 against 5,6,7 ? Then we would have observed outcome L with probability 0.75 and outcome B with probability 0.25 and we would never observe outcome R. The entropy here is 0.81.

In the above two uses of the scale we have already eliminated one outcome before we weigh the islanders. If we already know something about what will happen before we use the scale, it is a good idea to think about whether we are getting as much information as possible.

This tells us that solutions that do not work tend to have a lower entropy. Furthermore, if we look at the approach that we did use as well as what we know about maximising entropy, we can see that the first two uses of the scale involved experiments that had outcomes that were as close to being uniformly distributed as possible.



We now know how to measure the entropy of an ensemble and have shown how interpreting this entropy as information content can help us to ask questions/design experiments that will tell us more about the world.


  1. https://www.youtube.com/watch?v=Cs-TGLxQfBM
  2. Information Theory, Inference, and Learning Algorithms; David MacKay
How clear is this post?