Simpson’s Paradox

Introduction

A key consideration when analysing stratified data is how the behaviour of each category differs and how these differences might influence the overall observations about the data. For example, a data set might be split into one large category that dictates the overall behaviour or there may be a category with statistics that are significantly different from the other categories that skews the overall numbers. These features of the data are important to be aware of and go find to prevent drawing erroneous conclusions from your analysis. Context, the source of the data and a careful analysis of the data can prevent this. Simpson’s paradox is an interesting result of some of these effects.

The Paradox

Simpson’s paradox is observed in statistics when a trend is observed in a number of different groups but it is not observed in the overall data or the opposite trend is observed.

Observing the overall data might therefore lead us to draw a conclusion, but when the data is grouped we might conclude something different.…

By | January 5th, 2020|English, Level: Simple|1 Comment

What is mathematics?

Below you find some thoughts on this wide question, I encourage you to think about. What is your vision of mathematics? It will be most probably the result of your own experience with the subject, traumas that happened along the way and realizing that, could make you more conscious about your relationship with the subject and the walls you might have built against the subject or part of the subject. In a sense, by understanding the bias and blockage, by objectively thinking about its value, you could allow yourself to be able to equip yourself with the full set of skills mathematics gives you to build your own greatest life.

Many of thoughts in the following around this topic are taken from the paper I recommend to read: Teaching and Learning “What is Mathematics and why we should ask, where one should experience and learn that and how to teach it”, by Gunter M.…

By | December 16th, 2019|Uncategorized|0 Comments

Reasoning and making sense: a pillar of mathematics?

An essential part of learning mathematics is about reasoning and making sense. What does this exactly mean?

When a student is given a problem, he needs to make sense of it, from his level of perceptive which is unique to each individual. This will come with big struggle, and the important next step is to stay motivated, curious, be extremely perseverant and not give up after the first few attempts. This might also require a good relationship with mistakes.

A students will have to develop his own strategy to solve a given problem. That might imply first to translate it in their own language, use their own words and knowledge background to get (understand) the actual question and problem they are attempting to solve.

They will have to build bridge in their mind to similar problem they have solve in the past even though they might seem different. This bridge will be easier and easier to connect with practice and experience and sometimes might not work and some other connections will need to be created until finding a suitable one.…

By | December 16th, 2019|Uncategorized|0 Comments

The Wisdom of the Crowds

This content comes primarily from the notes of Mark Herbster (contributed to by Massi Pontil and John Shawe-Taylor) of University College London.

Introduction

The Wisdom of the Crowds, or majority rule and related ideas tend to come up pretty often. Democracy is based (partly) on the majority of people being able to make the correct decision, often you might make decisions in a group of friends based on what the most people want, and it is logical to take into account popular opinion when reasoning on issues where you have imperfect information. On the other hand, of course, there is the Argumentum ad Populum fallacy which states that a popular belief isn’t necessarily true.

This is idea appears also in Applied Machine Learning – ensemble methods such as Random Forests, Gradient Boosted Models (especially XGBoost) and stacking of Neural Networks have resulted in overall more powerful models. This is especially notable in Kaggle competitions, where it is almost always an ensemble model (combination of models) that achieves the best score.…

By | November 15th, 2019|Uncategorized|0 Comments

Automatic Differentiation

Much of this content is based on lecture slides from slides from Professor David Barber at University College London: resources relating to this can be found at: www.cs.ucl.ac.uk/staff/D.Barber/brml

What is Autodiff?

Autodiff, or Automatic Differentiation, is a method of determining the exact derivative of a function with respect to its inputs. It is widely used in machine learning- in this post I will give an overview of what autodiff is and why it is a useful tool.

The above is not a very helpful definition, so we can compare autodiff first to symbolic differentiation and numerical approximations before going into how it works.

Symbolic differentiation is what we do when we calculate derivatives when we do it by hand, i.e. given a function f, we find a new function f'. This is really good when we want to know how functions behave across all inputs. For example if we had f(x) = x^2 + 3x + 1 we can find the derivative as f'(x) = 2x + 3 and then we can find the derivative of the function for all values of x.…

By | October 23rd, 2019|English, Uncategorized|0 Comments

Captain Raymond Holt vs Claude Shannon

Overview

In this post I am going to introduce a pretty famous riddle, made popular recently by the police sitcom Brooklyn Nine-Nine as well as the idea of the entropy of a probability distribution, made popular by Claude Shannon. Then I am going to go through a solution that is presented in Information Theory, Inference, and Learning Algorithms (2), a brilliant book on the topic by the late David MacKay, as well as some intuitions from his lecture series on the topic. Hopefully, by the end of it, you will be familiar with another property of a probability distribution and be able to impress your friends with your riddle-solving abilities.

The Riddle

The riddle is presented by Captain Holt (pictured above) to his team of detectives as follows (1):

‘There are 12 men on an island, 11 weigh exactly the same amount, but 1 of them is slightly lighter or heavier: you must figure which.* The island has no scales, but there is a see-saw.

By | October 23rd, 2019|English, Fun|0 Comments

What did you expect? Some notes on the Expectation operator.

Introduction

A significant amount of focus in statistics is on making inference about the averages or means of phenomena. For example, we might be interested in the average number of goals scored per game by a football team, or the average global temperature or the average cost of a house in a particular area.

The two types of averages that we usually focus on are the sample mean from a set of data and the expectation that comes from a probability distribution. For example if three men weigh 70kg, 80kg, and 90kg respectively then the sample mean of their weight is \bar x = \frac{70+80+90}{3} = 80. Alternatively, we might say that the arrival times of trains are exponentially distributed with parameter \lambda = 3 we can use the properties of the exponential distribution to find the mean (or expectation). In this case the mean is \mu =  \frac{1}{\lambda} = \frac{1}{3}.

It is this second kind of mean (which we will call the expectation from now on), along with the generalisation of taking the expectation of functions of random variables that we will focus on.…

By | October 9th, 2019|English, Level: Simple, Uncategorized|0 Comments

The Gradient Vector

Introduction

In this post we introduce two important concepts in multivariate calculus: the gradient vector and the directional derivative. These both extend the idea of the derivative of a function of one variable, each in a different way. The aim of this post is to clarify what these concepts are, how they differ and show that the directional derivative is maximised in the direction of the gradient vector.

The gradient vector

The gradient vector, is, simply, a vector of partial derivatives. So to find this, we can 1) find the partial derivatives 2) put them into a vector.  So far so good. Let’s start this on some familiar territory: a function of 2 variables.

That is, let f: \mathbb{R}^2 \rightarrow \mathbb{R} be a function of 2 variables, x,y. Then the gradient vector can be written as:

\nabla f(x,y) =    \left [ {\begin{array}{c}    \frac{\partial f(x,y)}{\partial x}  \\    \frac{\partial f(x,y)}{\partial y}   \\    \end{array} } \right]

For a more tangible example, let f(x,y) = x^2 + 2xy, then:

\nabla f(x,y) =    \left [ {\begin{array}{c}    2x + 2y  \\    2x \\    \end{array} } \right]

So far, so good. Now we can generalise this for a function f: \mathbb{R}^n \rightarrow \mathbb{R} taking in a vector \mathbf{x} = x_1, x_2, x_3, \dots, x_n.…

By | September 24th, 2019|English, Undergraduate|0 Comments

The (Central) Cauchy distribution

The core of this post comes from Mathematical Statistics and Data Analysis by John A. Rice which is a useful resource for subjects such as UCT’s STA2004F.

Introduction

The Cauchy distribution has a number of interesting properties and is considered a pathological (badly behaved) distribution. What is interesting about it is that it is a distribution that we can think about in a number of different ways*, and we can formulate the probability density function these ways. This post will handle the derivation of the Cauchy distribution as a ratio of independent standard normals and as a special case of the Student’s t distribution.

Like the normal- and t-distributions, the standard form is centred on, and symmetric about 0. But unlike these distributions, it is known for its very heavy (fat) tails. Whereas you are unlikely to see values that are significantly larger or smaller than 0 coming from a normal distribution, this is just not the case when it comes to the Cauchy distribution.…

By | September 17th, 2019|English, Level: intermediate|0 Comments

K-means: Intuitions, Maths and Percy Tau

Much of this content is based on lecture slides from slides from Professor David Barber at University College London: resources relating to this can be found at: www.cs.ucl.ac.uk/staff/D.Barber/brml

The K-means algorithm

The K-means algorithm is one of the simplest unsupervised learning* algorithms. The aim of the K-means algorithm is, given a set of observations \mathbf{x}_1, \mathbf{x}_2, \dots \mathbf{x}_n, to group these observations into K different groups in the best way possible (‘best way’ here refers to minimising a loss/cost/objective function).

This is a clustering algorithm, where we want to assign each observation to a group that has other similar observations in it. This could be useful, for example, to split Facebook users into groups that will each be shown a different advertisement.

* unsupervised learning is performed on data without labels, i.e. we have a group of data points x_1, \dots, x_n (scalar or vector) and we want to find something out about how this data is structured.…

By | September 7th, 2019|English|0 Comments