Mean, Median and Mode as Representatives

This brief post discusses the underlying basis of the three measures of central tendency that we typically use to represent a distribution, sample or population: the mean, median and mode, prompted by a passing comment made by my Bayesian statistics professor.

What are the Mean, Median and Mode?

The mean, median and mode are all presented as estimators of the central tendency of data. A more meaningful and plain way of saying the same thing is that the mean, median and mode are all measures that we can use to represent our data if we want to do so using only a single number.

What is almost never explained is the unifying optimality criterion that underlies these three measures. This is one of minimising the sum, or equivalently average, distance of each of the data points away from the representation value, where the number of points is possibly (even uncountably) infinite.

The Optimisation Criterion

In order to represent some data, any representative value worth its salt should capture some notion of being the best, or optimal representative of those data. It follows naturally that we should define some difference between the representative and each of the points, and from there we can come up with a notion of the typical difference. This typical difference from each of the points is formalised to give the optimisation criterion for each of the mean, median and mode as measures of centrality.

Formally, if we have a sequence of (univariate) data points, $(x_1, x_2, \dots, x_n)$, we can define the difference between the representative, $r$, and any given point, $x_i$ very straightforwardly as the Euclidean distance. We take the absolute value as we don’t care about the direction the difference is in, which corresponds to its sign.

\[D(x_i, r) \triangleq \vert x_i - r \vert\]

This is where the distinguishing features of the mean, median and mode come in. We modify this distance by raising it to some exponent so as to modulate the severity of the penalty paid with increasing distance from the representative value. We’ll call the value of the exponent $p$.

\[D(x_i, r) \triangleq {\vert x_i - r \vert}^p\]

Note that these are the different norms used commonly in machine learning applications.

When we use $p = 2$, we penalise the squared distance or, put another way, we use a squared loss function and this gives rise to the mean. When we use $p = 1$, we penalise the absolute distance (usually called the “absolute deviation”) and this gives us the median. Finally, when we allow $p$ to go to zero, $p \rightarrow 0$, we obtain the mode (from the corresponding zero-one loss).

So overall we have:

$p = 2$ $\rightarrow$ mean
$p = 1$ $\rightarrow$ median
$p \rightarrow 0$ $\rightarrow$ mode

The final trivial step is to come up with some notion of what is a typical difference (or distance). We can do this by taking the sum of the individual differences we observe for each point. This gives the overall formula for our criterion, now in terms of our vector of data points with elements indexed over $i$.

\[D(\mathbf{x}, r) \triangleq \sum_i{\vert x_i - r \vert}^p\]

Implementing the Measures

Validating mathematical results with computations is a useful way to build confidence and cement your understanding. In that vein, let’s do some basic computations on simulated data to check the above.

First, we can implement the different loss criteria in a single line of R

# Loss Function 

# x     - the representative value (measure of "central tendency")
# data  - the sequence of data points
# L_p   - the exponent for the distance absolute value; what we called "p" above

loss_fn <- function(x, data, L_p) sum(abs(data - x)^L_p)

Then to reassure ourselves that these distinct criteria do in fact yield the estimates we’d expect, we can use numerical optimisation to rediscover the same values returned by R’s mean and median functions. I additionally used some functions I found to get the (first) modes of distributions as this is not a built-in function in R¹. Lastly, I also wrote a plain wrapper around R’s optim function, which will do the numerical optimisation.

# Mode functions to get the (first) mode for discrete or continuous data

mode_discrete <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

mode_continuous <- function(x) {
  lim.inf=min(x)-1; lim.sup=max(x)+1
  s<-density(x,from=lim.inf,to=lim.sup,bw=0.2)
  n<-length(s$y)
  v1<-s$y[1:(n-2)];
  v2<-s$y[2:(n-1)];
  v3<-s$y[3:n]
  ix<-1+which((v1<v2)&(v2>v3))
  s$x[which(s$y==max(s$y))]
}

# Wrapper for optim (numerical optimisation) to simplify code

numerical_solution <- function(L_p, data, fn=loss_fn) {
  optim(1, 
        fn=loss_fn, data=data, L_p=L_p, 
        method = "Brent", lower = 0, upper = 100)$par
}

Now we’ll set the size of our simulated vector of data for all our experiments to be one million points before proceeding to the experiments.

n <- 1e6 # sample size (number of simulated values)

Poisson Simulation

Then in our first run, we’ll simulate some random data from a Poisson distribution.

# Poisson Data ------------------------------------------------------------

lambda <- 6 # rate parameter

Poisson_data <- rpois(n, lambda)

mean(Poisson_data)
numerical_solution(2, Poisson_data)

median(Poisson_data)
numerical_solution(1, Poisson_data)

mode_discrete(Poisson_data)
numerical_solution(1e-6, Poisson_data)

We see that the numerical solution finds the same mean, of $5.997775$, as R’s built-in function and the same median of $6$! That said, Brent’s method does struggle on this set of discrete data with the optimisation procedure to find the mode returning $6$ instead of the value $5$ that our user-defined function returns.

Gaussian Simulation

The values returned by the function computations of central tendency are entirely in line with the results from numerical optimisation when we instead take our data from a Normal distribution.

# Normal (Gaussian) Data --------------------------------------------------

mu <- 12
std_dev <- 3

Gaussian_data <- rnorm(n, mu, std_dev)

mean(Gaussian_data)
numerical_solution(2, Gaussian_data)

median(Gaussian_data)
numerical_solution(1, Gaussian_data)

mode_continuous(Gaussian_data)
numerical_solution(1e-6, Gaussian_data)

There we have it. In the case of the continuous Gaussian data, the mean, median and mode returned by numerical optimisation are all within a hair of the values returned by the functions.

Both means returned are $12.00687$, both medians are $12.00642$ and the modes only differ by a small tolerance: mode_continuous returned $11.96263$ compared to $11.98303$ from the numerical procedure.

Other Distributions: Negative Binomial and Exponential

We can see similarly reassuring results (and similar optimisation issues when dealing with discrete data or solutions close to zero) when experimenting on other distributions. Here are two more examples if you’d like to run them yourself where data are simulated from Negative Binomial and Exponential distributions.

# Negative Binomial -------------------------------------------------------

library(stats)

size <- 30
p <- 0.6

neg_binom_data <- rnbinom(n, size = size, prob = p)

mean(neg_binom_data)
numerical_solution(2, neg_binom_data)

median(neg_binom_data)
numerical_solution(1, neg_binom_data)

mode_discrete(neg_binom_data)
numerical_solution(1e-6, neg_binom_data)

# Exponential Distribution ------------------------------------------------

exp_data <- rexp(n, rate = lambda) # simulate data

mean(exp_data)
numerical_solution(2, exp_data)

median(exp_data)
numerical_solution(1, exp_data)

mode_continuous(exp_data)
numerical_solution(1e-6, exp_data)

Shape of Our Simulated Data

Below we can see the shapes of the distributions of vector data we simulated. This suits to provide visual evidence of the figures we saw above being in line with what we would expect. When eye-balling, we can see that the centres of mass correspond roughly to the means, the central values the medians and the peaks the modes.

Four Simulations from Four Distributions

Four Sets of Simulated Data from the Poisson, Gaussian, Negative Binomial and Exponential Distributions

Closing Remarks

As we saw from the numerical procedures, the mean, median and mode are indeed the values returned by solving specific optimisation problems where the losses are defined as various norms or equivalently, summed distances.

This leads us to ask what would happen if we found measures of central tendency which optimised for different norms, for example, when $p = 1/2$ or $p = \infty$?

As a last remark, it is worth knowing that all the concepts outlined above have names. The generalised idea of a centroid defined for any given metric space is called a Fréchet mean, where we recall that a metric space is a set with a distance metric defined between any two members (or “points”) in that set.

This opens up weirder questions with more general scope like what if we considered the mean, median or mode in other metric spaces, for example if using hyperbolic geometry? Even stranger still, what if we move away from geometric spaces wholesale and consider the set of (finite) strings with distances between them defined by the Hamming or Levenshtein distance?

The R script to run the above simulations is available for download here and also as a Gist.

References and Notes

Read John Myles White (2013-03-22) Modes, Medians and Means: A Unifying Perspective. http://www.johnmyleswhite.com/notebook/2013/03/22/modes-medians-and-means-an-unifying-perspective/. In his post, White presents a clear perspective on the same concepts I’ve covered above, although arguably better than I could. His blog is generally highly recommended.

I got the mode function for discrete data from here which lists ways to get the first and all the modes of a distribution in R. We’ll go with the former to give only the first as all the examples I’ve used in this post are of unimodal simulated distributions of data. The mode function for continuous data is taken from here. My thanks to the authors. ↩