In this post we are going to create a simple reverse image search on the MNIST handwritten image dataset. That is to say, given any image, we want to return images that look most similar to it. To do this, we will use an autoencoder, trained using Tensorflow 2.

The dataset

The MNIST dataset is a commonly-used dataset in machine learning comprised of 28-by-28 images of handwritten digits between 0 and 9. For our purposes we would be interested in our image searcher returning images of the same number as the query images, i.e. if we input a 3 we want the images returned to all be 3s. However, if we had, say, four 3s and one 2 that mightn’t be too bad, considering how 2 and 3 look a bit similar. However, if we had three 3s, one 1 and a 7 we might say that the performance is not up to standard.

What is an autoencoder?

An autoencoder is a type of Neural Network that aims to learn a lower-dimensional representation of a dataset. It is made up of 2 parts: the encoder and the decoder. The former is trained to learn a representation of the input image in a `latent’ space that has a lower dimension than the input space, whereas the latter is trained to reproduce the original image from its latent representation.

The encoder network can be seen as a dimensionality reduction algorithm, similar to Principal Component Analysis (PCA) – with enough data and representative power, a neural network can outperform PCA as it allows for non-linearity in the decoder.

The autoencoder is trained to reproduce the input image – where the objective function during training is the reconstruction error (i.e. how closely the reconstruction image approximates the input image). This sounds a bit pointless at first because we have the image in the first place, but this allows us to train the encoder network to learn a meaningful representation of the image, which is what we will use for our downstream application – in this case finding similar-looking images.

We use an autoencoder here for a couple of key reasons:

  • the representations learned will have more `signal’ and less `noise’ than the input images – for example, there may be a number 3 written with an extra line coming out the top of it (if the handwriting is untidy) but during the compression into a lower-dimensional space this will be discarded by the network
  • numbers that are the same may be very different in the input space. Let us consider an extreme example: if we view the image as being a binary 28×28 grid where 1 indicates that there is writing and 0 indicates otherwise. Imagine having a number 1 written down the left of the grid and another image with the number 1 written down the right-hand side of the grid, the distance between these in euclidean space will be large, as there is no overlap between the images. However, an encoder might compress the shape of the number 1 i.e. learning that 2 inputs with a single non-zero column are similar – this would make the 2 images closer together in the learned representation
  • Finally, finding nearest neighbours in a 32 dimensional space is faster than in a 784 dimensional space (this is even more true as high-resolution or colour images are used)


Here is an example of the original image (size 784) and an image that has been reconstructed by the network from the latent space (size 32), as you can see, it is not exactly the same image but we can clearly make out the shape of the 8:


orig reconstruction

The Process

The process that we will follow to achieve this image search process is as follows:

  • Split the data into training and test sets
  • Train the autoencoder networks to reproduce the original images
  • Encode the test set using the encoding network – this gives us a low-dimensional representation of all the images that the autoencoder has not `seen’ during training
  • Choose a random query image from the test set
  • Find the 4 nearest* images to it in the latent space
  • return those

*`nearest’ in this case are the images that have the smallest euclidean distance to the query image

An example

Here we give an example of a query image, and the nearest images to it.

Query image:


4 nearest images in latent space:

6_1 6_2

6_4 6_5

Query Image


4 Nearest images in latent space

8_1 8_2

8_3 8_4

As we can see the autoencoder has learned the latent representations well – the images returned correspond to the same numbers as those in the query image.

However, this is by no means perfect, consider the following failure case:

Query image:


Closest 4 images

5_1 5_3

5_4 5_5


Here we see that the bottom of the 5 in the query image is closed, similar to what we would see from an 8. This could possibly explain the 8’s showing up in the nearest images. The exact reason why this might fail is difficult to pinpoint, but it is interesting to note the similarity in features between some of the query images and the returned nearest neighbours.



In summary: we have trained an autoeconder to compress and reproduce hand-written images and have used these compressed to find similar images to a query image by comparing distances between images in the latent space.

If you enjoyed this post, please rate it and leave any comments or suggestions. 

How clear is this post?