What did you expect? Some notes on the Expectation operator.

Introduction

A significant amount of focus in statistics is on making inference about the averages or means of phenomena. For example, we might be interested in the average number of goals scored per game by a football team, or the average global temperature or the average cost of a house in a particular area.

The two types of averages that we usually focus on are the sample mean from a set of data and the expectation that comes from a probability distribution. For example if three men weigh 70kg, 80kg, and 90kg respectively then the sample mean of their weight is \bar x = \frac{70+80+90}{3} = 80. Alternatively, we might say that the arrival times of trains are exponentially distributed with parameter \lambda = 3 we can use the properties of the exponential distribution to find the mean (or expectation). In this case the mean is \mu =  \frac{1}{\lambda} = \frac{1}{3}.

It is this second kind of mean (which we will call the expectation from now on), along with the generalisation of taking the expectation of functions of random variables that we will focus on.…

By | October 9th, 2019|English, Level: Simple, Uncategorized|0 Comments

The Gradient Vector

Introduction

In this post we introduce two important concepts in multivariate calculus: the gradient vector and the directional derivative. These both extend the idea of the derivative of a function of one variable, each in a different way. The aim of this post is to clarify what these concepts are, how they differ and show that the directional derivative is maximised in the direction of the gradient vector.

The gradient vector

The gradient vector, is, simply, a vector of partial derivatives. So to find this, we can 1) find the partial derivatives 2) put them into a vector.  So far so good. Let’s start this on some familiar territory: a function of 2 variables.

That is, let f: \mathbb{R}^2 \rightarrow \mathbb{R} be a function of 2 variables, x,y. Then the gradient vector can be written as:

\nabla f(x,y) =    \left [ {\begin{array}{c}    \frac{\partial f(x,y)}{\partial x}  \\    \frac{\partial f(x,y)}{\partial y}   \\    \end{array} } \right]

For a more tangible example, let f(x,y) = x^2 + 2xy, then:

\nabla f(x,y) =    \left [ {\begin{array}{c}    2x + 2y  \\    2x \\    \end{array} } \right]

So far, so good. Now we can generalise this for a function f: \mathbb{R}^n \rightarrow \mathbb{R} taking in a vector \mathbf{x} = x_1, x_2, x_3, \dots, x_n.…

By | September 24th, 2019|English, Undergraduate|0 Comments

The (Central) Cauchy distribution

The core of this post comes from Mathematical Statistics and Data Analysis by John A. Rice which is a useful resource for subjects such as UCT’s STA2004F.

Introduction

The Cauchy distribution has a number of interesting properties and is considered a pathological (badly behaved) distribution. What is interesting about it is that it is a distribution that we can think about in a number of different ways*, and we can formulate the probability density function these ways. This post will handle the derivation of the Cauchy distribution as a ratio of independent standard normals and as a special case of the Student’s t distribution.

Like the normal- and t-distributions, the standard form is centred on, and symmetric about 0. But unlike these distributions, it is known for its very heavy (fat) tails. Whereas you are unlikely to see values that are significantly larger or smaller than 0 coming from a normal distribution, this is just not the case when it comes to the Cauchy distribution.…

By | September 17th, 2019|English, Level: intermediate|0 Comments

K-means: Intuitions, Maths and Percy Tau

Much of this content is based on lecture slides from slides from Professor David Barber at University College London: resources relating to this can be found at: www.cs.ucl.ac.uk/staff/D.Barber/brml

The K-means algorithm

The K-means algorithm is one of the simplest unsupervised learning* algorithms. The aim of the K-means algorithm is, given a set of observations \mathbf{x}_1, \mathbf{x}_2, \dots \mathbf{x}_n, to group these observations into K different groups in the best way possible (‘best way’ here refers to minimising a loss/cost/objective function).

This is a clustering algorithm, where we want to assign each observation to a group that has other similar observations in it. This could be useful, for example, to split Facebook users into groups that will each be shown a different advertisement.

* unsupervised learning is performed on data without labels, i.e. we have a group of data points x_1, \dots, x_n (scalar or vector) and we want to find something out about how this data is structured.…

By | September 7th, 2019|English|0 Comments

p-values (part 3): meta distribution of p-values

Introduction

So far we have discussed what p-values are and how they are calculated, as well as how bad experiments can lead to artificially small p-values. The next thing that we will look at comes from a paper by N.N. Taleb (1), in which he derives the meta-distribution of p-values i.e. what ranges of p-values we might expect if we repeatedly did an experiment where we sampled from the same underlying distribution.

The derivations are pretty in depth and this content and the implications of the results are pretty new to me, so any discrepancies/misinterpretations found should be pointed out and/or discussed.

Thankfully, in this video (2) there is an explanation that covers some of what the paper says as well as some Monte-Carlo simulations. My discussion will focus on some simulations of my own that are based on those that are done in the video.

What we are talking about

We have already discussed what p-values mean and how they can go wrong.…

By | September 5th, 2019|English, Level: intermediate|1 Comment

p-values (part 2) : p-Hacking Why drinking red wine is not the same as exercising

What is p-hacking?

You might have heard about a reproducibility problem with scientific studies. Or you might have heard that drinking a glass of red wine every evening is equivalent to an hour’s worth of exercise.

Part of the reason that you might have heard about these things is p-hacking: ‘torturing the data until it confesses’. The reason for doing this is mostly pressure on researchers to find positive results (as these are more likely to be published) but it may also arise from misapplication of Statistical procedures or bad experimental design.

Some of the content here is based on a more serious video from Veritasium: https://www.youtube.com/watch?v=42QuXLucH3Q. John Oliver has also spoken about this on Last Week Tonight, for those who are interested in some more examples of science that makes its way onto morning talk shows.

p-hacking can be done in a number of ways- basically anything that is done either consciously or unconsciously to produce statistically significant results where there aren’t any.…

By | September 2nd, 2019|English, Undergraduate|1 Comment

A quick argument for why we don’t accept the null hypothesis

Introduction

When doing hypothesis testing, an often-repeated rule is ‘never accept the null hypothesis’. The reason for this is that we aren’t making probability statements about true underlying quantities, rather we are making statements about the observed data, given a hypothesis.

We reject the null hypothesis if the observed data is unlikely to be observed given the null hypothesis. In a sense we are trying to disprove the null hypothesis and the strongest thing we can say about it is that we fail to reject the null hypothesis.

That is because observing data that is not unlikely given that a hypothesis is true does not make that hypothesis true. That is a bit of a mouthful, but basically what we are saying is that if we make some claim about the world and then we see some data that does not disprove this claim, we cannot conclude that the claim is true.…

By | August 28th, 2019|English, Level: Simple, Uncategorized, Undergraduate|0 Comments

p-values: an introduction (Part 1)

The starting point

This is the first of (at least) 3 posts on p-values. p-values are everywhere in statistics- especially in fields that require experimental design.

They are also pretty tricky to get your head around at first. This is because of the nature of classical (frequentist) statistics. So to motivate this I am going to talk about a non-statistical situation that will hopefully give some intuition about how to think when interpreting p-values and doing hypothesis testing.

My New Car

I want to buy a car. So I go down to the second hand car dealership to get one. I walk around a bit until I find one that I like.

I think to myself: ‘this is a good car’. 

Now because I am at a second-hand car dealership I find it appropriate to gather some data. So I chat to the lady there (looks like a bit of a scammer, but I am here for a deal) about the car.…

By | August 21st, 2019|English, Level: Simple, Undergraduate|0 Comments

R-squared values for linear regression

What we are talking about

Linear regression is a common and useful statistical tool. You will have almost certainly come across it if your studies have presented you with any sort of statistical problems.

The pros of regression are that it is relatively easy to implement and that the relationship between inputs and outputs is linear (it’s in the name, but this simplifies the interpretation of the relationship significantly). On the downside, it relies fairly heavily on frequentist interpretation of probability (which is a little counterintuitive) and it’s very easy to draw erroneous conclusions from different models.

This post will deal with a measure of how good a model is: R^2. First, I will go through what this value means and what it measures. Then, I will discuss an example of how reliance on  R^2  is a dangerous game when it comes to linear models.

What you should know

Firstly, let’s establish a bit of context.…

By | August 18th, 2019|English, Undergraduate|1 Comment

The 2018 South African Mathematics Olympiad — Problem 6

The final round of the South African Mathematics Olympiad will be taking place on Thursday, 28 July 2019. I have been writing about some of the problems from the senior paper from 2018. A list of all of the problems can be found here.

Today we will look at the sixth and final problem from the 2018 South African Mathematics Olympiad:

Let n be a positive integer, and let x_1, x_2, \dots, x_n be distinct positive integers with x_1 = 1. Construct an n \times 3 table where the entries of the k-th row are x_k, 2x_k, 3x_k for k = 1, 2, \dots, n. Now follow a procedure where, in each step, two identical entries are removed from the table. This continues until there are no more identical entries in the table.

  1. Prove that at least three entries remain at the end of the procedure.
  2. Prove that there are infinitely many possible choices for n and x_1, x_2, \dots, x_n such that only three entries remain,

There are some heuristics that are often helpful when solving a problem, such as

  • Looking at small cases:

    This helps us to understand the problem and how the various pieces in the problem relate to each other.

By | July 23rd, 2019|Competition, English|1 Comment