Data Scientist and Actuarial Analyst. University of Cape Town: BBusSci Actuarial Science University College London: MSc Computational Statistics and Machine Learning I am interested in understanding and sharing mathematical ideas, especially relating to probability and statistics.

Introduction

A key consideration when analysing stratified data is how the behaviour of each category differs and how these differences might influence the overall observations about the data. For example, a data set might be split into one large category that dictates the overall behaviour or there may be a category with statistics that are significantly different from the other categories that skews the overall numbers. These features of the data are important to be aware of and go find to prevent drawing erroneous conclusions from your analysis. Context, the source of the data and a careful analysis of the data can prevent this. Simpson’s paradox is an interesting result of some of these effects.

Simpson’s paradox is observed in statistics when a trend is observed in a number of different groups but it is not observed in the overall data or the opposite trend is observed.

Observing the overall data might therefore lead us to draw a conclusion, but when the data is grouped we might conclude something different.…

## The Wisdom of the Crowds

This content comes primarily from the notes of Mark Herbster (contributed to by Massi Pontil and John Shawe-Taylor) of University College London.

Introduction

The Wisdom of the Crowds, or majority rule and related ideas tend to come up pretty often. Democracy is based (partly) on the majority of people being able to make the correct decision, often you might make decisions in a group of friends based on what the most people want, and it is logical to take into account popular opinion when reasoning on issues where you have imperfect information. On the other hand, of course, there is the Argumentum ad Populum fallacy which states that a popular belief isn’t necessarily true.

This is idea appears also in Applied Machine Learning – ensemble methods such as Random Forests, Gradient Boosted Models (especially XGBoost) and stacking of Neural Networks have resulted in overall more powerful models. This is especially notable in Kaggle competitions, where it is almost always an ensemble model (combination of models) that achieves the best score.…

## Automatic Differentiation

Much of this content is based on lecture slides from slides from Professor David Barber at University College London: resources relating to this can be found at: www.cs.ucl.ac.uk/staff/D.Barber/brml

What is Autodiff?

Autodiff, or Automatic Differentiation, is a method of determining the exact derivative of a function with respect to its inputs. It is widely used in machine learning- in this post I will give an overview of what autodiff is and why it is a useful tool.

The above is not a very helpful definition, so we can compare autodiff first to symbolic differentiation and numerical approximations before going into how it works.

Symbolic differentiation is what we do when we calculate derivatives when we do it by hand, i.e. given a function $f$, we find a new function $f'$. This is really good when we want to know how functions behave across all inputs. For example if we had $f(x) = x^2 + 3x + 1$ we can find the derivative as $f'(x) = 2x + 3$ and then we can find the derivative of the function for all values of $x$.…

## Captain Raymond Holt vs Claude Shannon

Overview

In this post I am going to introduce a pretty famous riddle, made popular recently by the police sitcom Brooklyn Nine-Nine as well as the idea of the entropy of a probability distribution, made popular by Claude Shannon. Then I am going to go through a solution that is presented in Information Theory, Inference, and Learning Algorithms (2), a brilliant book on the topic by the late David MacKay, as well as some intuitions from his lecture series on the topic. Hopefully, by the end of it, you will be familiar with another property of a probability distribution and be able to impress your friends with your riddle-solving abilities.

The Riddle

The riddle is presented by Captain Holt (pictured above) to his team of detectives as follows (1):

‘There are 12 men on an island, 11 weigh exactly the same amount, but 1 of them is slightly lighter or heavier: you must figure which.* The island has no scales, but there is a see-saw.

By | October 23rd, 2019|English, Fun|0 Comments
Gallery

## What did you expect? Some notes on the Expectation operator.

Introduction

A significant amount of focus in statistics is on making inference about the averages or means of phenomena. For example, we might be interested in the average number of goals scored per game by a football team, or the average global temperature or the average cost of a house in a particular area.

The two types of averages that we usually focus on are the sample mean from a set of data and the expectation that comes from a probability distribution. For example if three men weigh 70kg, 80kg, and 90kg respectively then the sample mean of their weight is $\bar x = \frac{70+80+90}{3} = 80$. Alternatively, we might say that the arrival times of trains are exponentially distributed with parameter $\lambda = 3$ we can use the properties of the exponential distribution to find the mean (or expectation). In this case the mean is $\mu = \frac{1}{\lambda} = \frac{1}{3}$.

It is this second kind of mean (which we will call the expectation from now on), along with the generalisation of taking the expectation of functions of random variables that we will focus on.…

Introduction

In this post we introduce two important concepts in multivariate calculus: the gradient vector and the directional derivative. These both extend the idea of the derivative of a function of one variable, each in a different way. The aim of this post is to clarify what these concepts are, how they differ and show that the directional derivative is maximised in the direction of the gradient vector.

The gradient vector, is, simply, a vector of partial derivatives. So to find this, we can 1) find the partial derivatives 2) put them into a vector.  So far so good. Let’s start this on some familiar territory: a function of 2 variables.

That is, let $f: \mathbb{R}^2 \rightarrow \mathbb{R}$ be a function of 2 variables, x,y. Then the gradient vector can be written as:

$\nabla f(x,y) = \left [ {\begin{array}{c} \frac{\partial f(x,y)}{\partial x} \\ \frac{\partial f(x,y)}{\partial y} \\ \end{array} } \right]$

For a more tangible example, let $f(x,y) = x^2 + 2xy$, then:

$\nabla f(x,y) = \left [ {\begin{array}{c} 2x + 2y \\ 2x \\ \end{array} } \right]$

So far, so good. Now we can generalise this for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ taking in a vector $\mathbf{x} = x_1, x_2, x_3, \dots, x_n$.…

## The (Central) Cauchy distribution

The core of this post comes from Mathematical Statistics and Data Analysis by John A. Rice which is a useful resource for subjects such as UCT’s STA2004F.

Introduction

The Cauchy distribution has a number of interesting properties and is considered a pathological (badly behaved) distribution. What is interesting about it is that it is a distribution that we can think about in a number of different ways*, and we can formulate the probability density function these ways. This post will handle the derivation of the Cauchy distribution as a ratio of independent standard normals and as a special case of the Student’s t distribution.

Like the normal- and t-distributions, the standard form is centred on, and symmetric about 0. But unlike these distributions, it is known for its very heavy (fat) tails. Whereas you are unlikely to see values that are significantly larger or smaller than 0 coming from a normal distribution, this is just not the case when it comes to the Cauchy distribution.…

## K-means: Intuitions, Maths and Percy Tau

Much of this content is based on lecture slides from slides from Professor David Barber at University College London: resources relating to this can be found at: www.cs.ucl.ac.uk/staff/D.Barber/brml

The K-means algorithm

The K-means algorithm is one of the simplest unsupervised learning* algorithms. The aim of the K-means algorithm is, given a set of observations $\mathbf{x}_1, \mathbf{x}_2, \dots \mathbf{x}_n$, to group these observations into K different groups in the best way possible (‘best way’ here refers to minimising a loss/cost/objective function).

This is a clustering algorithm, where we want to assign each observation to a group that has other similar observations in it. This could be useful, for example, to split Facebook users into groups that will each be shown a different advertisement.

* unsupervised learning is performed on data without labels, i.e. we have a group of data points $x_1, \dots, x_n$ (scalar or vector) and we want to find something out about how this data is structured.…

By | September 7th, 2019|English|0 Comments
Gallery

## p-values (part 3): meta distribution of p-values

Introduction

So far we have discussed what p-values are and how they are calculated, as well as how bad experiments can lead to artificially small p-values. The next thing that we will look at comes from a paper by N.N. Taleb (1), in which he derives the meta-distribution of p-values i.e. what ranges of p-values we might expect if we repeatedly did an experiment where we sampled from the same underlying distribution.

The derivations are pretty in depth and this content and the implications of the results are pretty new to me, so any discrepancies/misinterpretations found should be pointed out and/or discussed.

Thankfully, in this video (2) there is an explanation that covers some of what the paper says as well as some Monte-Carlo simulations. My discussion will focus on some simulations of my own that are based on those that are done in the video.

We have already discussed what p-values mean and how they can go wrong.…

## p-values (part 2) : p-Hacking Why drinking red wine is not the same as exercising

What is p-hacking?

You might have heard about a reproducibility problem with scientific studies. Or you might have heard that drinking a glass of red wine every evening is equivalent to an hour’s worth of exercise.

Part of the reason that you might have heard about these things is p-hacking: ‘torturing the data until it confesses’. The reason for doing this is mostly pressure on researchers to find positive results (as these are more likely to be published) but it may also arise from misapplication of Statistical procedures or bad experimental design.

Some of the content here is based on a more serious video from Veritasium: https://www.youtube.com/watch?v=42QuXLucH3Q. John Oliver has also spoken about this on Last Week Tonight, for those who are interested in some more examples of science that makes its way onto morning talk shows.

p-hacking can be done in a number of ways- basically anything that is done either consciously or unconsciously to produce statistically significant results where there aren’t any.…