How to calculate the five number summary (correctly)?
Note: This explains the so-called Method 2 which
includes the median in the calculation of the quartiles.
This method is used by R.
The following summary yields a method compatible with the
function fivenum of the statistical package R.
A note on notation
Instead of giving concrete numerical examples, we will assume that the sorted data are
$$ x_1,\, x_2,\, \ldots,\, x_n $$
We also assume that the data has been sorted in ascending
order. The sample size will always be denoted by \(n\).
The basics
The five number summary of a data sample consists of 5 numbers:- The minimum
- The first quartile (\( Q_1 \))
- The median (M)
- The third quartile (\( Q_3 \))
- The minimum
The "floor" function
If x is a real number then \(\floor{x}\) is the largest whole number not greater than \(x\). The floor function is available in Microsoft Excel and in most programming languages and software packages. When entering this function from the keyboard, one typically typesfloor(x)
Examples
- \( \floor{5}=5 \)
- \( \floor{3.14}=3 \)
- \( \floor{-1}=-1 \)
- \( \floor{-1.3}=-2 \)
Sorting of the data
The first step is ordering the data from the smallest to the largest.Determining the positions of the five numbers
Let n be the samples size. The positions in the sorted dataset are computed based on n only. The positions may be fractional, that is, they may fall between the real data positions, which are 1, 2, ..., n (whole numbers). Here are the positions, in the order of difficulty of understanding:- The position of the minimum is 1;
- The position of the maximum is \(n\);
- The position of the median is \(\frac{n+1}{2}\) and it is fractional if \(n\) is even;
- The position of \( Q_1 \) is \(n_4 = \frac{\floor{\frac{n+3}{2}}}{2}\);
- The position of \( Q_3 \) is \( n+1-n_4 \).
How to calculate the five numbers from their position?
The general rule is simple:- If a position is a whole number, the corresponding number equals to data at that position;
- If a position is a fractional number, the corresponding number is obtained by averaging the data at the nearest two whole number positions.
An interpretation of \(n_4\)
Let us elaborate on the meaning of the formula
$$ n_4 = \frac{\floor{\frac{n+3}{2}}}{2} $$
It is easy to see that this formula is equivalent to:
$$ n_4 = \frac{\floor{\frac{n+1}{2}}+1}{2} $$
The number \(m=\floor{\frac{n+1}{2}}\) is the right-most
position to the left of the position of the median (the
position of the median may be that position; this is why the method
explained here is Method 2).
Thus,
$$ n_4 = \frac{m+1}{2} $$
That is, it is the position of the median of all data to the left of the median (sometimes including the
median itself). This gives an alternative method to compute the position of \(Q_1\):
Summary of the calculation of \(Q_1\)
- Compute the position of the median first; this is \(\frac{n+1}{2}\);
- Round it down to the nearest integer; this is \(m\) defined above; this is the right-most position to the left of the position of the median (this may coincide with the position of the median);
- Compute the position of the median of data \(x_1,\,x_2,\,\ldots,\,x_m\); this is \(n_4=\frac{m+1}{2}\).
- Compute \(Q_1\); if \(n_4\) is a whole number, \(Q_1 = x_{n_4}\); if \(n_4\) is fractional \(Q_1 = \frac{1}{2}(x_{n_4'}+x_{n_4''})\) where \(n_4'\) and \(n_4''\) are the whole numbers nearest to \(n_4\);
Examples
\(n=50\)
We compute n4 (the position of the first quartile) first:
$$ n_4 = \frac{\floor{\frac{50+3}{2}}}{2} = \frac{\floor{26.5}}{2} = \frac{26}{2} = 13. $$
Number | Position |
Minimum | 1 |
Maximum | 50 |
Median | \( \frac{50+1}{2}=25.5\) |
\( Q_1 \) | 13 |
\( Q_3 \) | \( 50+1-13=38 \) |
$$ 1,\, 13,\, 25.5,\, 38,\, 50 $$
The five numbers are:
$$ x_1,\, x_{13},\, \frac{x_{25}+x_{26}}{2},\, x_{38},\, x_{50} $$
\( n=65 \)
We compute \( n_4 \) first:
$$ n_4 = \frac{\floor{\frac{65+3}{2}}}{2} = \frac{\floor{34}}{2} = \frac{34}{2} = 17. $$
Number | Position |
Minimum | 1 |
Maximum | 65 |
Median | \( \frac{65+1}{2}=33\) |
\( Q_1 \) | 17 |
\( Q_3 \) | \( 65+1-17=49 \) |
$$ 1,\, 17,\, 33,\, 49,\, 65 $$
The five numbers are:
$$ x_1,\,x_{17},\,x_{33},\,x_{49},\,x_{65} $$
n=19
We compute n4 first:
$$ n_4 = \frac{\floor{\frac{19+3}{2}}}{2} = \frac{\floor{11}}{2} = \frac{11}{2} = 5.5. $$
Number | Position |
Minimum | 1 |
Maximum | 19 |
Median | \( \frac{19+1}{2}=10 \) |
\( Q_1 \) | 5.5 |
\( Q_3 \) | \( 19+1-5.5=14.5 \) |
$$ 1,\, 5.5,\, 10,\, 14.5,\, 65 $$
The five numbers are:
$$ x_1,\,\frac{x_5+x_6}{2},\, x_{10},\, \frac{x_{14}+x_{15}}{2},\, x_{19} $$
For the curious - the R language code of the fivenum
The following R session reveals the code of fivenum. The above explanation concerns the 3 lines of R code in the curly braces after the word "else":> fivenum function (x, na.rm = TRUE) { xna <- is.na(x) if (na.rm) x <- x[!xna] else if (any(xna)) return(rep.int(NA, 5)) x <- sort(x) n <- length(x) if (n == 0) rep.int(NA, 5) else {
n4 <- floor((n + 3)/2)/2 d <- c(1, n4, (n + 1)/2, n + 1 - n4, n) 0.5 * (x[floor(d)] + x[ceiling(d)])
} } <environment: namespace:stats> >