A significant amount of focus in statistics is on making inference about the averages or means of phenomena. For example, we might be interested in the average number of goals scored per game by a football team, or the average global temperature or the average cost of a house in a particular area.

The two types of averages that we usually focus on are the sample mean from a set of data and the expectation that comes from a probability distribution. For example if three men weigh 70kg, 80kg, and 90kg respectively then the sample mean of their weight is \bar x = \frac{70+80+90}{3} = 80. Alternatively, we might say that the arrival times of trains are exponentially distributed with parameter \lambda = 3 we can use the properties of the exponential distribution to find the mean (or expectation). In this case the mean is \mu =  \frac{1}{\lambda} = \frac{1}{3}.

It is this second kind of mean (which we will call the expectation from now on), along with the generalisation of taking the expectation of functions of random variables that we will focus on.

Expected values

Alright, it is now a good time to define what the expectation operator does.

Say we have a discrete random variable, X, that follows a distribution with a probability mass function p. Then the expectation of a function of X:

\mathbb{E}[X] = \displaystyle \sum_x x p(x)

For a continuous random variable, we have that the expectation operator is now an integral (limit of a sum)* and the probability mass function is replaced by a probability density function, f:

\mathbb{E}[X] = \displaystyle \int_x x f(x) dx.

From these definitions, it is clear to see that we are taking the probability-weighted average of the random variable. That is, each value of X contributes an amount to the expectation that is determined by both the size of the random variable itself and the probability of that value being observed.

For the sake of interpreting these values, the ‘expected value’ is somewhat of a misleading name. Let’s say for example you are planning to go out to an event: entrance is R60 before 9pm and R100 after 9pm. You are meeting your friends beforehand and will travel to the event together. You believe that there is a 30% chance of arriving before 9pm and a 70% chance of arriving after 9pm. Calculating the expected value of X, the amount that you pay for entrance is done as follows:

\mathbb{E}[X] = 60(0.3) + 100(0.7) = 88

That’s fine and correct, but its pretty odd that you in no way expect to pay R88 for the ticket. If you were asked ‘what do you expect to pay’- you might answer ‘well, we might get there early but most likely I will end up paying R100, so I expect to pay that’. It is important to keep this distinction in check.

*this is a great way to remember the linearity of expectations, \mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y]  

Expectations of other functions

Now that we can find the mean of functions we might want to extend this to more general functions. Now our expectation of some function, g(X) is:

\mathbb{E}[X] = \sum_x g(x) p(x)  or \mathbb{E}[X] =  \int_x g(x) f(x) dx.

This generalisation can be really useful if we know (or can estimate) the probability distribution of the random variable. The most common example would be g(X) = (\mathbb{E}[X]-X)^2 which would give us the variance.

Really, we could put in any function that keeps the integral convergent into the expectation operator.

Say, for example, that we had a business where we sold interesting and wonderful things. Say we also knew the distribution of the number of things that we expected to sell, f and that for a given number of things sold, x, we would make a profit of g(x) = x^2 - 7x - 3, then we could find our expected profit, simply by taking the expectation of g. Again, this looks just like taking a probability-weighted average of the function. We will expect to make a large profit if large values of occur with high probability i.e. the values of X that give us a large also give us a large f. 

Some examples

Now we may visualise the some examples of expectations of functions and hopefully give a little more clarity on the topic.

For each of the plots below, the density/mass function for a random variable is shown in black. The red function is the function over which we will integrate in order to find the appropriate expectation- it is the contribution to the probability-weighted average for each value of X.

The value of the expectation is therefore the area under the red curve.

The first plot we consider is the mean of a standard normal distribution. As we can see the red line shows an odd function, where the integral over a range symmetric about 0 is 0. This makes sense- the distribution is symmetric about 0 so the negative values of x contribute a negative amount to the expectation and the positive values contribute positively to the expectation. As the values of x get bigger, the value of the density tends to 0 faster than x, so very large values of x contribute nothing to the expectation (see the areas where the red line is at 0).


g(x) = x

Our next two plots contrast the difference in variance between 2 different normal distributions, both centred on 0. The first is the standard normal distribution and the second has a standard deviation of 2. Since the mean is zero we can find the variance simply by using g(x) = x^2. As we can see the area under the red curve in the second plot is much greater (in fact, 4 times greater). This is due to the tails of distribution getting fatter- higher values of x^2 are associated with higher values of the density, resulting in there being a larger area under the red function. The probability weighted values of x^2 have increased- in this case because the probability density at those values of x have increased.


g(x) = x^2


g(x) = x^2

Finally we look at the Exponential and Poisson distributions (with parameters \lambda = 3 \text{ and } \lambda = 4 respectively) and their expectations. For the exponential distribution we see that as the value of x increases, so the density decreases exponentially (unsurprising giving the name)- since the value of x increases only linearly, we find that the value of the red function is highest when x is small. So even though the support of the distribution runs from 0 to infinity, most of what we expect to see comes from a very small proportion of this.


g(x) = x

For a poisson distribution, which is discrete, the exception is now found by summing the distances from the red points to the horizontal axis. For a Poission distribution with \lambda = 4, the expected value of X is equal to 4. We can see where this 4 comes from by looking at the plot below. The mass is highest for values of X near 4 so these values contribute the most to the expectation i.e. 4*p(X=4) is high as is 5*p(X=5). Looking back at the definition of the expectation, we know that 15*p(X=15) also contributes to it- but looking at the plot we see this value is very small- why? Because the probability of X=15 is so small that the contribution to the expectation is negligible.


g(x) = x


The expectation operator has a number of uses and is a very important concept in statistics- a good way to think about the expectation of a function is as a probability weighted average of that function, summed (integrated) over the values that the function can take on. This has been demonstrated by creating the ‘red function’ which shows the probability weighted values. From this we can interpret the expectation as the area under the red curve (for continuous random variables), which can help us to understand what is going on when we take an expectation.

How clear is this post?