Much of this content is based on lecture slides from slides from Professor David Barber at University College London: resources relating to this can be found at: www.cs.ucl.ac.uk/staff/D.Barber/brml

The K-means algorithm

The K-means algorithm is one of the simplest unsupervised learning* algorithms. The aim of the K-means algorithm is, given a set of observations $\mathbf{x}_1, \mathbf{x}_2, \dots \mathbf{x}_n$, to group these observations into K different groups in the best way possible (‘best way’ here refers to minimising a loss/cost/objective function).

This is a clustering algorithm, where we want to assign each observation to a group that has other similar observations in it. This could be useful, for example, to split Facebook users into groups that will each be shown a different advertisement.

* unsupervised learning is performed on data without labels, i.e. we have a group of data points $x_1, \dots, x_n$ (scalar or vector) and we want to find something out about how this data is structured. Compare to supervised learning where we have some observation, $\mathbf{x}$, and some value that we want to predict, y.

How it works

Our goal is to assign each point, $\mathbf{x}_i$ to some cluster, c(i), such that the total distance from the points to the centres of their respective clusters is minimised i.e. we want to find both the assignments (which cluster a point goes to) and centres (the mean of the points in each cluster- which we will call $\mathbf{m}_j$) and minimise:

$\mathcal{L} = \displaystyle \sum_{i=1}^{n} ||\mathbf{x}_i - \mathbf{m}_{c(i)}||_2^2$

That is, the total distance from the points to the centres of the clusters to which they are assigned.

To do this we:

• randomly initialise the centre vectors
• assign each point, $\mathbf{x}_i$ to the nearest centre, c(i) (i.e. the centre that has the smallest Euclidean distance to the point)
• recalculate the centre vectors as the average of all of the points assigned to that centre
• repeat until convergence

This is a simple application of the more general Expectation Maximisation (EM) algorithm.

Choosing the number of clusters

We want to minimise the square distance between the points and the centres of their respective clusters. We could just do this by assigning each point to its own cluster. That way the centre of the cluster would be the point and the loss would be 0. But that defeats the point a little so we will have a look at a heuristic for choosing the best number of clusters.

We use the ‘elbow method’ (bad name). This entails looking at a plot of $\mathcal{L}$ against K. The ‘elbow’ is the point at which we stop seeing significant improvement in the loss function.

In this example, we can see that at the point 3, the slope of the value of the loss function is less steep, so 3 looks like a good number of clusters.

Some shortcomings

• It is not probabilistic- other methods can assign a point to a cluster with some probability, p
• It is sensitive to outliers- squaring distances between points and centres can result in large values if a point is far from a centre- this can lead to bad assignments
• For data with shapes that are less obvious than those shown in our example, this method may perform poorly

A worked example

This worked example is based partially on a post that comes from Jake VanderPlas’s Data Science handbook (1). This post aims to be completely transparent in terms of the process and the maths behind the algorithm. The notebook can be found on my Gitlab page (3) with some more details on how this is done.

First we need to generate some data. I have done this by sampling from 4 different 2D Gaussian distributions (each distribution is indicated by a different colour). This is the true underlying clustering of the data i.e. the points in each group are generated by a different underlying process and we want to cluster them accordingly).

No we initialise the centres (red stars). These points don’t look like good centres for our clusters, but we have to start somewhere.  The points are all blue now to emphasise that we do not really know which cluster they belong to and this is the form in which we will get real data.

Now we assign each point to the nearest centre (star) and we colour code them accordingly. This is our first clustering. Now at this point this is good! We have split our data into 4 groups as desired. The bit that is missing, however, is that those 4 clusters are far from optimal. So the next step is to move our centres to the average of the existing cluster and then assign the points to the nearest of the new centres.

Below is the state of the algorithm after one recalculation of centres and one reassignment (one full iteration of the EM algorithm). The stars in the below plot are the averages of the points in the above plot that have matching colours. The colour assignment is based on which of these stars each point is closest to.

We keep doing this until our algorithm has converged, i.e. until points are always assigned to the same means.

Now looking at this plot we have the stars pretty much bang in the middle of the points that are assigned to them. Now this is not perfect (some points have different clusters in the first plot than they do in the last plot) but it is the best that the algorithm could do with the initial centres and this dataset.

A more interesting example

Now that we have visualised it we can look at an example that is slightly more interesting. For this we will use the SciKitLearn implementation. The problem here will be based on the FIFA19 dataset that can be found on Kaggle (reference 2).

Here we want to group players based on their attributes (strength, pace, shooting, tackling, passing, goalkeeping etc). We train on the first 2000 players in the dataset(the dataset seems to be roughly sorted by how famous the players are) and split them into 4 clusters (the elbow plot above belongs to this problem, so 3 might be better, but I wanted 4 because of how football works).

The first 6 players in each cluster are:

Cluster 0: D. Godín, G. Chiellini, K. Koulibaly, Piqué, R. Varane, L. Bonucci

Cluster 1: L. Messi, Cristiano Ronaldo, Neymar Jr, E. Hazard, L. Suárez, R. Lewandowski

Cluster 2: De Gea, J. Oblak, M. ter Stegen, T. Courtois, M. Neuer, H. Lloris

Cluster 3: K. De Bruyne, L. Modrić, Sergio Ramos, T. Kroos, N. Kanté, Sergio Busquets

Alright that looks good! The clusters, in order, are: defenders, attacking players, goalkeepers and midfielders. The only exception we see here is Sergio Ramos who is a defender in the midfielder group- but he is Spanish so maybe FIFA think he is good at passing the ball, making him more midfielder-like.

So it looks like our K-means works!

What about a new player? Percy Tau? His attributes, according to K-means, are closest to group 1 (because he’s basically Messi) Darren Keet? Cluster 2, with the goalkeepers.

This shows that we can not only use this algorithm to group players from the training set but we can also predict which cluster a new observation will fit into. The sanity check here is that we already roughly know which players are similar (either because we watch football or because people usually tell you which position they play).

Summary

This post served to introduce K-means, work through a simple visual example of the algorithm and apply it to a problem that was a little bit of fun and showed how it might relate to a practical problem.

1. https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html