Data Scientist and Actuarial Analyst. University of Cape Town: BBusSci Actuarial Science University College London: MSc Computational Statistics and Machine Learning I am interested in understanding and sharing mathematical ideas, especially relating to probability and statistics.

## A simple introduction to causal inference

Introduction

Causal inference is a branch of Statistics that is increasing in popularity. This is because it allows us to answer questions in a more direct way than do other methods. Usually, we can make inference about association or correlation between a variable and an outcome of interest, but these are often subject to outside influences and may not help us answer the questions in which we are most interested.

Causal inference seeks to remedy this by measuring the effect on the outcome (or response variable) that we see when we change another variable (the ‘treatment’). In a sense, we are looking to reproduce the situation that we have when we do an designed experiment (with a ‘treated’ and a ‘control’ group). The goal here is to have groups that are otherwise the same (with regard to factors that might influence the outcome) but where one is ‘treated’ and the other is not.…

## Correlation vs Mutual Information

This post is based on a (very small) part of the (dense and technical) paper Fooled by Correlation by N.N. Taleb, found at (1)

Notes on the main ideas in this post are available from Universidad de Cantabria, found at (2)

The aims of this post are to 1) introduce mutual information as a measure of similarity and 2) to show the nonlinear relationship between correlation and information my means of a relatively simple example

Introduction

A significant part of Statistical analysis is understanding how random variables are related – how much knowledge about the value of one variable tells us about the value of another. This post will consider this issue in the context of Gaussian random variables. More specifically, we will compare- and discuss the relationship between- correlation and mutual information.

Mutual Information

The Mutual Information between 2 random variables is the amount of information that one gains about a random variable by observing the value of the other.…

## The Objective Function

In both Supervised and Unsupervised machine learning, most algorithms are centered around minimising (or, equivalently) maximising some objective function. This function is supposed to somehow represent what the model knows/can get right. Normally, as one would expect, the objective function does not always reflect exactly what we want.

The objective function presents 2 main problems: 1. how do we minimise it (the answer to this is up for debate and there is lots of interesting research about efficient optimisation of non-convex functions and 2) assuming we can minimise it perfectly, is it the correct thing to be minimising?

It is point 2 which is the focus of this post.

Let’s take the example of square-loss-linear-regression. To do so we train a linear regression model with a square loss $\mathcal{L}(\mathbf{w})=\sum_i (y_i - \mathbf{w}^Tx_i)^2$. (Where we are taking the inner product of learned weights with a vector of features for each observation to predict the outcome).…

Introduction

A key consideration when analysing stratified data is how the behaviour of each category differs and how these differences might influence the overall observations about the data. For example, a data set might be split into one large category that dictates the overall behaviour or there may be a category with statistics that are significantly different from the other categories that skews the overall numbers. These features of the data are important to be aware of and go find to prevent drawing erroneous conclusions from your analysis. Context, the source of the data and a careful analysis of the data can prevent this. Simpson’s paradox is an interesting result of some of these effects.

Simpson’s paradox is observed in statistics when a trend is observed in a number of different groups but it is not observed in the overall data or the opposite trend is observed.

Observing the overall data might therefore lead us to draw a conclusion, but when the data is grouped we might conclude something different.…

## The Wisdom of the Crowds

This content comes primarily from the notes of Mark Herbster (contributed to by Massi Pontil and John Shawe-Taylor) of University College London.

Introduction

The Wisdom of the Crowds, or majority rule and related ideas tend to come up pretty often. Democracy is based (partly) on the majority of people being able to make the correct decision, often you might make decisions in a group of friends based on what the most people want, and it is logical to take into account popular opinion when reasoning on issues where you have imperfect information. On the other hand, of course, there is the Argumentum ad Populum fallacy which states that a popular belief isn’t necessarily true.

This is idea appears also in Applied Machine Learning – ensemble methods such as Random Forests, Gradient Boosted Models (especially XGBoost) and stacking of Neural Networks have resulted in overall more powerful models. This is especially notable in Kaggle competitions, where it is almost always an ensemble model (combination of models) that achieves the best score.…

## Automatic Differentiation

Much of this content is based on lecture slides from slides from Professor David Barber at University College London: resources relating to this can be found at: www.cs.ucl.ac.uk/staff/D.Barber/brml

What is Autodiff?

Autodiff, or Automatic Differentiation, is a method of determining the exact derivative of a function with respect to its inputs. It is widely used in machine learning- in this post I will give an overview of what autodiff is and why it is a useful tool.

The above is not a very helpful definition, so we can compare autodiff first to symbolic differentiation and numerical approximations before going into how it works.

Symbolic differentiation is what we do when we calculate derivatives when we do it by hand, i.e. given a function $f$, we find a new function $f'$. This is really good when we want to know how functions behave across all inputs. For example if we had $f(x) = x^2 + 3x + 1$ we can find the derivative as $f'(x) = 2x + 3$ and then we can find the derivative of the function for all values of $x$.…

## Captain Raymond Holt vs Claude Shannon

Overview

In this post I am going to introduce a pretty famous riddle, made popular recently by the police sitcom Brooklyn Nine-Nine as well as the idea of the entropy of a probability distribution, made popular by Claude Shannon. Then I am going to go through a solution that is presented in Information Theory, Inference, and Learning Algorithms (2), a brilliant book on the topic by the late David MacKay, as well as some intuitions from his lecture series on the topic. Hopefully, by the end of it, you will be familiar with another property of a probability distribution and be able to impress your friends with your riddle-solving abilities.

The Riddle

The riddle is presented by Captain Holt (pictured above) to his team of detectives as follows (1):

‘There are 12 men on an island, 11 weigh exactly the same amount, but 1 of them is slightly lighter or heavier: you must figure which.* The island has no scales, but there is a see-saw.

By | October 23rd, 2019|English, Fun|0 Comments
Gallery

## What did you expect? Some notes on the Expectation operator.

Introduction

A significant amount of focus in statistics is on making inference about the averages or means of phenomena. For example, we might be interested in the average number of goals scored per game by a football team, or the average global temperature or the average cost of a house in a particular area.

The two types of averages that we usually focus on are the sample mean from a set of data and the expectation that comes from a probability distribution. For example if three men weigh 70kg, 80kg, and 90kg respectively then the sample mean of their weight is $\bar x = \frac{70+80+90}{3} = 80$. Alternatively, we might say that the arrival times of trains are exponentially distributed with parameter $\lambda = 3$ we can use the properties of the exponential distribution to find the mean (or expectation). In this case the mean is $\mu = \frac{1}{\lambda} = \frac{1}{3}$.

It is this second kind of mean (which we will call the expectation from now on), along with the generalisation of taking the expectation of functions of random variables that we will focus on.…

Introduction

In this post we introduce two important concepts in multivariate calculus: the gradient vector and the directional derivative. These both extend the idea of the derivative of a function of one variable, each in a different way. The aim of this post is to clarify what these concepts are, how they differ and show that the directional derivative is maximised in the direction of the gradient vector.

The gradient vector, is, simply, a vector of partial derivatives. So to find this, we can 1) find the partial derivatives 2) put them into a vector.  So far so good. Let’s start this on some familiar territory: a function of 2 variables.

That is, let $f: \mathbb{R}^2 \rightarrow \mathbb{R}$ be a function of 2 variables, x,y. Then the gradient vector can be written as:

$\nabla f(x,y) = \left [ {\begin{array}{c} \frac{\partial f(x,y)}{\partial x} \\ \frac{\partial f(x,y)}{\partial y} \\ \end{array} } \right]$

For a more tangible example, let $f(x,y) = x^2 + 2xy$, then:

$\nabla f(x,y) = \left [ {\begin{array}{c} 2x + 2y \\ 2x \\ \end{array} } \right]$

So far, so good. Now we can generalise this for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ taking in a vector $\mathbf{x} = x_1, x_2, x_3, \dots, x_n$.…

## The (Central) Cauchy distribution

The core of this post comes from Mathematical Statistics and Data Analysis by John A. Rice which is a useful resource for subjects such as UCT’s STA2004F.

Introduction

The Cauchy distribution has a number of interesting properties and is considered a pathological (badly behaved) distribution. What is interesting about it is that it is a distribution that we can think about in a number of different ways*, and we can formulate the probability density function these ways. This post will handle the derivation of the Cauchy distribution as a ratio of independent standard normals and as a special case of the Student’s t distribution.

Like the normal- and t-distributions, the standard form is centred on, and symmetric about 0. But unlike these distributions, it is known for its very heavy (fat) tails. Whereas you are unlikely to see values that are significantly larger or smaller than 0 coming from a normal distribution, this is just not the case when it comes to the Cauchy distribution.…