This post is based on a (very small) part of the (dense and technical) paper Fooled by Correlation by N.N. Taleb, found at (1)

Notes on the main ideas in this post are available from Universidad de Cantabria, found at (2)

The aims of this post are to 1) introduce mutual information as a measure of similarity and 2) to show the nonlinear relationship between correlation and information my means of a relatively simple example

Introduction

A significant part of Statistical analysis is understanding how random variables are related – how much knowledge about the value of one variable tells us about the value of another. This post will consider this issue in the context of Gaussian random variables. More specifically, we will compare- and discuss the relationship between- correlation and mutual information.

Mutual Information

The Mutual Information between 2 random variables is the amount of information that one gains about a random variable by observing the value of the other. It is linked to the entropy – a key concept in Information Theory (which can be thought of as how surprised we are on average when observing a random variable – as discussed in a previous post). Mutual information, like entropy, is measured in bits. It is considered more general than correlation and handles nonlinear dependencies and discrete random variables.

The mutual information between random variables X and Y is defined as: $MI(X_1;X_2) = D_{KL} (P(X_1,X_2)||P(X_1) \bigotimes P(X_2))$

Where $P(X_1,X_2)$ is the joint and $P(X_1)$ and $P(X_2)$ are the marginal distributions of the random variables in question. $D_{KL}$ is the Kullback-Liebner divergence between (in this case bivariate) probability distributions. We won’t go into the details of this, but in general it is a measure of similarity between distributions.

This essentially means that the mutual information is comparing the distance* between the joint distribution and the product of the marginal distributions. If this value is 0, it means that the joint distribution and the product of the marginal distributions are the same. This checks out intuitively, as that would mean that the random variables are statistically independent and therefore that knowing X gives us 0 information about Y. Considering the other extreme, a high value of the Mutual Information means that the joint and the product of the marginal distributions are very different – a lot of information is lost by considering the random variables to be independent and therefore we can conclude that the values of each random variable are related to one another.

Example: Gaussian Random Variables

Let’s have a look at an example of mutual information, this example is chosen 1) because its tractable and 2) because we can find $MI(X_1,X_2)$ in terms of the usual measure of similarity, the correlation coefficient $\rho$.

If $\mathbf{x} = [X_{1}, X_{2}]$ follows a bivariate Gaussian distribution, with mean vector $\mathbf{0}$, unit variance for both variables and a correlation coefficient $\rho$, we write the pdf as: $f_{\mathbf{x}} (\mathbf{x}) = \frac{1}{2\pi |\Sigma|^{1/2}} \exp[-\frac{1}{2} (\mathbf{x}^T\Sigma^{-1} \mathbf{x})]$

where $\Sigma = \left [ {\begin{array}{cc} 1 & \rho \\ \rho & 1 \\ \end{array} } \right]$

To calculate the MI we need the following results, shown without proof:

• $I(X_1,X_2) = H(X_1)+ H(X_2) - H(X_1,X_2)$, where $H(.)$  is the entropy of a random variable
• If X follows a standard Gaussian distribution, $H(X) = \frac{1}{2} \log(2\pi e)$
• for the bivariate gaussian considered above, $H(X_1,X_2) = \frac{1}{2} \log((2\pi e)^2(1-\rho^2))$

This means that our mutual information can be written as: $I(X_1,X_2) = \frac{1}{2} \log(2\pi e) + \frac{1}{2} \log(2 \pi e) - \frac{1}{2} \log((2 \pi e)^2(1- \rho^2))$ $= - \frac{1}{2} \log((1- \rho^2))$

Relationship with Correlation

We can see from the previous line that mutual information can be thought of as a function of correlation. Importantly, it can be thought of as a non-linear function of correlation.

Figure 1, information as a function of correlation: As can be clearly seen, the amount of information that we get about one random variable when we observe the other increases slowly at first and then much, much faster around $\rho = 0.8$ and then even more significantly around $\rho = 0.9$. When $\rho = 1$ the amount of information tends to infinity – if random variables are perfectly correlated then we know everything about one of them by observing the other one.

The plots below give some further insight into how the strength of the correlation does not increase linearly with the value of $\rho$. Here the value of $\rho$ has been varied from 0 (no correlation) to 1 (perfect correlation). The red lines are straight lines through the origin. Clearly, as the value of $\rho$ increases, the values of $X_1$ and $X_2$ are more closely associated. However, the important thing to note here is how quickly the cloud of points starts resembling a straight line (i.e. the progression from no correlation to perfect correlation). In (1) the author points out that a correlation of 0.5 is closer to 0 than it is to 1. Similarly a correlation of 0.2 is visually closer to 0 than one of 0.7 is to 0.9.      We can go further than this, by comparing the relative levels of information as we increase the value of $\rho$.

Consider MI( $\rho$) as defined above at $\rho = 0.5,0.6,0.8,0.9$.

MI(0.5) = 0.14

MI(0.6) = 0.22

This so an increase of 0.1 in correlation leads to 1.55 times as much information.

MI(0.8) = 0.51

MI(0.9) =0.83

This so an increase of 0.1 in correlation leads to 1.62 times as much information.

MI(0.99) = 1.96 – this means that moving from 0.9 to 0.99 gives 2.36 times as much information.

This is more in line with the intuition behind the above plots – at higher values of the correlation coefficient, the mutual information between the random variables increases significantly.

Conclusion

Mutual information is a fundamental and interesting way of thinking about how two random variables are related. In the case of the bivariate Gaussian distribution it is a highly non-linear function of the usual measure of how random variables are related, i.e. correlation.

We have also seen how very high correlations are much more informative than high correlations and these, in turn, are much more informative than lower-valued correlations (which tell us very little).

*KL divergence is not a proper distance as it does not satisfy the triangle inequality