About Dean Bunce

Data Scientist and Actuarial Analyst. University of Cape Town: BBusSci Actuarial Science University College London: MSc Computational Statistics and Machine Learning I am interested in understanding and sharing mathematical ideas, especially relating to probability and statistics.

Parrondos Paradox

Introduction

In this post we will have a look at Parrondos paradox. In a paper* entitled “Information Entropy and Parrondo’s Discrete-Time Ratchet”** the authors demonstrate a situation where, by switching between 2 losing strategies, we can create a winning strategy.

Setup

The setup to this paradox is as follows:

We have 2 games that we can play – if we win we get 1 unit of wealth, if we lose, it costs 1 unit of wealth. Game A gives us a payout of 1 with a probability of slightly less than 0.5. Clearly if we play this game for long enough we will end up losing.

Game B is a little more complicated in that it is defined with reference to our existing winnings. If our current level of wealth is a multiple of M we play a game where the probability of winning is slightly less than 0.1. If it is not a multiple of M, the probability of winning is slightly less than 0.75.…

By | November 11th, 2020|Uncategorized|0 Comments

Basic Reverse Image Search Using an Autoencoder

Introduction

In this post we are going to create a simple reverse image search on the MNIST handwritten image dataset. That is to say, given any image, we want to return images that look most similar to it. To do this, we will use an autoencoder, trained using Tensorflow 2.

The dataset

The MNIST dataset is a commonly-used dataset in machine learning comprised of 28-by-28 images of handwritten digits between 0 and 9. For our purposes we would be interested in our image searcher returning images of the same number as the query images, i.e. if we input a 3 we want the images returned to all be 3s. However, if we had, say, four 3s and one 2 that mightn’t be too bad, considering how 2 and 3 look a bit similar. However, if we had three 3s, one 1 and a 7 we might say that the performance is not up to standard.…

By | October 21st, 2020|Uncategorized|0 Comments

A simple introduction to causal inference

 

Introduction

Causal inference is a branch of Statistics that is increasing in popularity. This is because it allows us to answer questions in a more direct way than do other methods. Usually, we can make inference about association or correlation between a variable and an outcome of interest, but these are often subject to outside influences and may not help us answer the questions in which we are most interested.

Causal inference seeks to remedy this by measuring the effect on the outcome (or response variable) that we see when we change another variable (the ‘treatment’). In a sense, we are looking to reproduce the situation that we have when we do an designed experiment (with a ‘treated’ and a ‘control’ group). The goal here is to have groups that are otherwise the same (with regard to factors that might influence the outcome) but where one is ‘treated’ and the other is not.…

By | August 20th, 2020|English, Uncategorized|0 Comments

Correlation vs Mutual Information

This post is based on a (very small) part of the (dense and technical) paper Fooled by Correlation by N.N. Taleb, found at (1)

Notes on the main ideas in this post are available from Universidad de Cantabria, found at (2)

The aims of this post are to 1) introduce mutual information as a measure of similarity and 2) to show the nonlinear relationship between correlation and information my means of a relatively simple example

Introduction

A significant part of Statistical analysis is understanding how random variables are related – how much knowledge about the value of one variable tells us about the value of another. This post will consider this issue in the context of Gaussian random variables. More specifically, we will compare- and discuss the relationship between- correlation and mutual information.

Mutual Information

The Mutual Information between 2 random variables is the amount of information that one gains about a random variable by observing the value of the other.…

By | March 28th, 2020|English, Level: intermediate, Uncategorized|0 Comments

The Objective Function

In both Supervised and Unsupervised machine learning, most algorithms are centered around minimising (or, equivalently) maximising some objective function. This function is supposed to somehow represent what the model knows/can get right. Normally, as one would expect, the objective function does not always reflect exactly what we want.

The objective function presents 2 main problems: 1. how do we minimise it (the answer to this is up for debate and there is lots of interesting research about efficient optimisation of non-convex functions and 2) assuming we can minimise it perfectly, is it the correct thing to be minimising?

It is point 2 which is the focus of this post.

Let’s take the example of square-loss-linear-regression. To do so we train a linear regression model with a square loss \mathcal{L}(\mathbf{w})=\sum_i (y_i - \mathbf{w}^Tx_i)^2. (Where we are taking the inner product of learned weights with a vector of features for each observation to predict the outcome).…

By | February 20th, 2020|Level: Simple, Uncategorized|0 Comments

Simpson’s Paradox

Introduction

A key consideration when analysing stratified data is how the behaviour of each category differs and how these differences might influence the overall observations about the data. For example, a data set might be split into one large category that dictates the overall behaviour or there may be a category with statistics that are significantly different from the other categories that skews the overall numbers. These features of the data are important to be aware of and go find to prevent drawing erroneous conclusions from your analysis. Context, the source of the data and a careful analysis of the data can prevent this. Simpson’s paradox is an interesting result of some of these effects.

The Paradox

Simpson’s paradox is observed in statistics when a trend is observed in a number of different groups but it is not observed in the overall data or the opposite trend is observed.

Observing the overall data might therefore lead us to draw a conclusion, but when the data is grouped we might conclude something different.…

By | January 5th, 2020|English, Level: Simple|1 Comment

The Wisdom of the Crowds

This content comes primarily from the notes of Mark Herbster (contributed to by Massi Pontil and John Shawe-Taylor) of University College London.

Introduction

The Wisdom of the Crowds, or majority rule and related ideas tend to come up pretty often. Democracy is based (partly) on the majority of people being able to make the correct decision, often you might make decisions in a group of friends based on what the most people want, and it is logical to take into account popular opinion when reasoning on issues where you have imperfect information. On the other hand, of course, there is the Argumentum ad Populum fallacy which states that a popular belief isn’t necessarily true.

This is idea appears also in Applied Machine Learning – ensemble methods such as Random Forests, Gradient Boosted Models (especially XGBoost) and stacking of Neural Networks have resulted in overall more powerful models. This is especially notable in Kaggle competitions, where it is almost always an ensemble model (combination of models) that achieves the best score.…

By | November 15th, 2019|Uncategorized|0 Comments

Automatic Differentiation

Much of this content is based on lecture slides from slides from Professor David Barber at University College London: resources relating to this can be found at: www.cs.ucl.ac.uk/staff/D.Barber/brml

What is Autodiff?

Autodiff, or Automatic Differentiation, is a method of determining the exact derivative of a function with respect to its inputs. It is widely used in machine learning- in this post I will give an overview of what autodiff is and why it is a useful tool.

The above is not a very helpful definition, so we can compare autodiff first to symbolic differentiation and numerical approximations before going into how it works.

Symbolic differentiation is what we do when we calculate derivatives when we do it by hand, i.e. given a function f, we find a new function f'. This is really good when we want to know how functions behave across all inputs. For example if we had f(x) = x^2 + 3x + 1 we can find the derivative as f'(x) = 2x + 3 and then we can find the derivative of the function for all values of x.…

By | October 23rd, 2019|English, Uncategorized|0 Comments

Captain Raymond Holt vs Claude Shannon

Overview

In this post I am going to introduce a pretty famous riddle, made popular recently by the police sitcom Brooklyn Nine-Nine as well as the idea of the entropy of a probability distribution, made popular by Claude Shannon. Then I am going to go through a solution that is presented in Information Theory, Inference, and Learning Algorithms (2), a brilliant book on the topic by the late David MacKay, as well as some intuitions from his lecture series on the topic. Hopefully, by the end of it, you will be familiar with another property of a probability distribution and be able to impress your friends with your riddle-solving abilities.

The Riddle

The riddle is presented by Captain Holt (pictured above) to his team of detectives as follows (1):

‘There are 12 men on an island, 11 weigh exactly the same amount, but 1 of them is slightly lighter or heavier: you must figure which.* The island has no scales, but there is a see-saw.

By | October 23rd, 2019|English, Fun|0 Comments

What did you expect? Some notes on the Expectation operator.

Introduction

A significant amount of focus in statistics is on making inference about the averages or means of phenomena. For example, we might be interested in the average number of goals scored per game by a football team, or the average global temperature or the average cost of a house in a particular area.

The two types of averages that we usually focus on are the sample mean from a set of data and the expectation that comes from a probability distribution. For example if three men weigh 70kg, 80kg, and 90kg respectively then the sample mean of their weight is \bar x = \frac{70+80+90}{3} = 80. Alternatively, we might say that the arrival times of trains are exponentially distributed with parameter \lambda = 3 we can use the properties of the exponential distribution to find the mean (or expectation). In this case the mean is \mu =  \frac{1}{\lambda} = \frac{1}{3}.

It is this second kind of mean (which we will call the expectation from now on), along with the generalisation of taking the expectation of functions of random variables that we will focus on.…

By | October 9th, 2019|English, Level: Simple, Uncategorized|0 Comments