In both Supervised and Unsupervised machine learning, most algorithms are centered around minimising (or, equivalently) maximising some objective function. This function is supposed to somehow represent what the model knows/can get right. Normally, as one would expect, the objective function does not always reflect exactly what we want.

The objective function presents 2 main problems: 1. how do we minimise it (the answer to this is up for debate and there is lots of interesting research about efficient optimisation of non-convex functions and 2) assuming we can minimise it perfectly, is it the correct thing to be minimising?

It is point 2 which is the focus of this post.

Let’s take the example of square-loss-linear-regression. To do so we train a linear regression model with a square loss $\mathcal{L}(\mathbf{w})=\sum_i (y_i - \mathbf{w}^Tx_i)^2$. (Where we are taking the inner product of learned weights with a vector of features for each observation to predict the outcome).

Let’s take stock of the situation:

What we want: For our model to be as correct as possible

What we have asked for: For the total squared difference in actual and predicted amounts in our data to be as small as possible

Clearly, these are different. But not unrelated – often they can lead to materially the same solutions. However, there are cases where they don’t. For example, the square loss of a really good solution (in terms of being close to the ‘truth’) can be increased significantly by a(some) outliers(s). The square loss is sensitive to outliers because the distance between the predicted value and the true value is squared- so a large error is made relatively larger than it otherwise would have been.

As an alternative we might minimise the absolute value of the errors i.e. $\mathcal{L} (\mathbf{w}) = \sum_i |y_i - \mathbf{w}^Tx_i|$.

Fig. 1 Best fit lines for Mean Absolute Deviation (MAD) and Ordinary Least Squares (OLS) regression.

Fig 2. Best fit lines for Mean Absolute Deviation (MAD) and Ordinary Least Squares (OLS) regression in the presence of outliers.

As can be seen from the above plot, the presence of the 2 outliers in the bottom right hand corner have caused the fit of the OLS line to change visibly, whereas in the case of MAD the solution is more robust. (Of course this is a contrived example and the outliers are easy to spot and could be handled in other ways).

Here, the choosing MAD as the objective function improves the chances of this model to generalise to points with x-values above 100 or to points not present in the training set.

This extends to more complex models, and the choice of objective function can affect the model in different ways:

• When the Wasserstein GAN (Generative Adversarial Network) was introduced it was claimed that the loss function correlated more closely with high fidelity samples than Goodfellow’s original GAN.
• The use of the likelihood function in place of a square loss function when training an Autoencoder makes part of the optimisation problem (minimising the loss with respect to the weights and biases in the last layer)  convex (and therefore easier to solve)
• The $\beta$-VAE adjusts the objective function of a standard VAE to encourage the learning of disentangled representations

As usual, for those in data science/applied machine learning, the consequences and risks of the decision need to be taken into account. If our model is inappropriate due to the loss function being misspecified, there may be unintended consequences that come from model predictions (even if the ‘fit’ of the model is good).

 How clear is this post?