Introduction
This guide is an introduction to Spearman's rank correlation coefficient, its mathematical calculation, and its computation via Python's pandas
library. We'll construct various examples to gain a basic understanding of this coefficient and demonstrate how to visualize the correlation matrix via heatmaps.
What Is the Spearman Rank Correlation Coefficient?
Spearman rank correlation is closely related to the Pearson correlation, and both are a bounded value, from -1
to 1
denoting a correlation between two variables.
If you'd like to read more about the alternative correlation coefficient - read our Guide to the Pearson Correlation Coefficient in Python.
The Pearson correlation coefficient is computed using raw data values, whereas the Spearman correlation is calculated from the ranks of individual values. While the Pearson correlation coefficient is a measure of the linear relation between two variables, the Spearman rank correlation coefficient measures the monotonic relation between a pair of variables. To understand the Spearman correlation, we need a basic understanding of monotonic functions.
Monotonic Functions
There are monotonically increasing, monotonically decreasing, and non-monotonic functions.
For a monotonically increasing function, as X increases, Y also increases (and it doesn't have to be linear). For a monotonically decreasing function, as one variable increases, the other one decreases (also doesn't have to be linear). A non-monotonic function is where the increase in the value of one variable can sometimes lead to an increase and sometimes lead to a decrease in the value of the other variable.
Spearman rank correlation coefficient measures the monotonic relation between two variables. Its values range from -1 to +1 and can be interpreted as:
- +1: Perfectly monotonically increasing relationship
- +0.8: Strong monotonically increasing relationship
- +0.2: Weak monotonically increasing relationship
- 0: Non-monotonic relation
- -0.2: Weak monotonically decreasing relationship
- -0.8: Strong monotonically decreasing relationship
- -1: Perfectly monotonically decreasing relationship
Mathematical Expression
Suppose we have \(n\) observations of two random variables, \(X\) and \(Y\). We first rank all values of both variables as \(X_r\) and \(Y_r\) respectively. The Spearman rank correlation coefficient is denoted by \(r_s\) and is calculated by:
$$
r_s = \rho_{X_r,Y_r} = \frac{\text{COV}(X_r,Y_r)}{\text{STD}(X_r)\text{STD}(Y_r)} = \frac{n\sum\limits_{x_r\in X_r, y_r \in Y_r} x_r y_r - \sum\limits_{x_r\in X_r}x_r\sum\limits_{y_r\in Y_r}y_r}{\sqrt{\Big(n\sum\limits_{x_r \in X_r} x_r^2 -(\sum\limits_{x_r\in X_r}x_r)^2\Big)}\sqrt{\Big(n\sum\limits_{y_r \in Y_r} y_r^2 - (\sum\limits_{y_r\in Y_r}y_r)^2 \Big)}}
$$
Here, COV()
is the covariance, and STD()
is the standard deviation. Before we see Python's functions for computing this coefficient, let's do an example computation by hand to understand the expression and get to appreciate it.
Example Computation
Suppose we are given some observations of the random variables \(X\) and \(Y\). The first step is to convert \(X\) and \(Y\) to \(X_r\) and \(Y_r\), which represent their corresponding ranks. A few intermediate values would also be needed, which are shown below:
Let's use the formula from before to compute the Spearman correlation:
Great! Though, calculating this manually is time-consuming, and the best use of computers is to, well, compute things for us. Computing the Spearman correlation is really easy and straightforward with built-in functions in Pandas.
Computing the Spearman Rank Correlation Coefficient Using Pandas
The various correlation coefficients, including Spearman, can be computed via the
corr()
method of the Pandas library.
As an input argument, the corr()
function accepts the method to be used for computing correlation (spearman
in our case). The method is called on a DataFrame
, say of size mxn
, where each column represents the values of a random variable and m
represents the total samples of each variable.
For n
random variables, it returns an nxn
square matrix R
. R(i,j)
indicates the Spearman rank correlation coefficient between the random variable i
and j
. As the correlation coefficient between a variable and itself is 1, all diagonal entries (i,i)
are equal to unity. In short:
Note that the correlation matrix is symmetric as correlation is symmetric, i.e., M(i,j)=M(j,i)
. Let's take our simple example from the previous section and see how to use Pandas' corr()
function:
import numpy as np
import pandas as pd
import seaborn as sns # For pairplots and heatmaps
import matplotlib.pyplot as plt
We'll be using Pandas for the computation itself, Matplotlib with Seaborn for visualization and NumPy for additional operations on the data.
The code below computes the Spearman correlation matrix on the dataframe x_simple
. Note the ones on the diagonals, indicating that the correlation coefficient of a variable with itself is naturally, one:
x_simple = pd.DataFrame([(-2,4),(-1,1),(0,3),(1,2),(2,0)],
columns=["X","Y"])
my_r = x_simple.corr(method="spearman")
print(my_r)
X Y
X 1.0 -0.7
Y -0.7 1.0
Visualizing the Correlation Coefficient
Given the table-like structure of bounded intensities, [-1, 1]
- a natural and convenient way of visualizing the correlation coefficient is a heatmap.
If you'd like to read more about heatmaps in Seaborn, read our Ultimate Guide to Heatmaps in Seaborn with Python!
A heatmap is a grid of cells, where each cell is assigned a color according to its value, and this visual way of interpreting correlation matrices is much easier for us than parsing numbers. For small tables like the one previously output - it's perfectly fine. But with a lot of variables, it's much harder to actually interpret what's going on.
Let's define a display_correlation()
function that computes the correlation coefficient and displays it as a heatmap:
def display_correlation(df):
r = df.corr(method="spearman")
plt.figure(figsize=(10,6))
heatmap = sns.heatmap(df.corr(), vmin=-1,
vmax=1, annot=True)
plt.title("Spearman Correlation")
return(r)
Let's call display_correlation()
on our r_simple
DataFrame to visualize the Spearman correlation:
r_simple=display_correlation(x_simple)
Understanding the Spearman's Correlation Coefficient on Synthetic Examples
To understand the Spearman correlation coefficient, let's generate a few synthetic examples that accentuate how the coefficient works - before we dive into more natural examples. These examples will help us understand, for what type of relationships this coefficient is +1, -1, or close to zero.
Before generating the examples, we'll create a new helper function, plot_data_corr()
, that calls display_correlation()
and plots the data against the X
variable:
def plot_data_corr(df,title,color="green"):
r = display_correlation(df)
fig, ax = plt.subplots(nrows=1, ncols=len(df.columns)-1,figsize=(14,3))
for i in range(1,len(df.columns)):
ax[i-1].scatter(df["X"],df.values[:,i],color=color)
ax[i-1].title.set_text(title[i] +'\n r = ' +
"{:.2f}".format(r.values[0,i]))
ax[i-1].set(xlabel=df.columns[0],ylabel=df.columns[i])
fig.subplots_adjust(wspace=.7)
plt.show()
Monotonically Increasing Functions
Let's generate a few monotonically increasing functions, using NumPy, and take a peek at the DataFrame
once filled with the synthetic data:
seed = 11
rand = np.random.RandomState(seed)
# Create a data frame using various monotonically increasing functions
x_incr = pd.DataFrame({"X":rand.uniform(0,10,100)})
x_incr["Line+"] = x_incr.X*2+1
x_incr["Sq+"] = x_incr.X**2
x_incr["Exp+"] = np.exp(x_incr.X)
x_incr["Cube+"] = (x_incr.X-5)**3
print(x_incr.head())
X | Line+ | Sq+ | Exp+ | Cube+ | |
---|---|---|---|---|---|
0 | 1.802697 | 4.605394 | 3.249716 | 6.065985 | -32.685221 |
1 | 0.194752 | 1.389505 | 0.037929 | 1.215010 | -110.955110 |
2 | 4.632185 | 10.264371 | 21.457140 | 102.738329 | -0.049761 |
3 | 7.249339 | 15.498679 | 52.552920 | 1407.174809 | 11.380593 |
4 | 4.202036 | 9.404072 | 17.657107 | 66.822246 | -0.508101 |
Now let's look at the Spearman correlation's heatmap and the plot of various functions against X
:
plot_data_corr(x_incr,["X","2X+1","$X^2$","$e^X$","$(X-5)^3$"])
We can see that for all these examples, there is a perfectly monotonically increasing relationship between the variables. The Spearman correlation is a +1, regardless of whether the variables have a linear or a non-linear relationship.
Pearson would've produced much different results here, since it's computed based on the linear relationship between the variables.
As long as Y increases as X increases, without fail, the Spearman Rank Correlation Coefficient will be 1.
Monotonically Decreasing Functions
Let's repeat the same examples on monotonically decreasing functions. We'll again generate synthetic data and compute the Spearman rank correlation. First, let's look at the first 4 rows of the DataFrame
:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
# Create a data matrix
x_decr = pd.DataFrame({"X":rand.uniform(0,10,100)})
x_decr["Line-"] = -x_decr.X*2+1
x_decr["Sq-"] = -x_decr.X**2
x_decr["Exp-"] = np.exp(-x_decr.X)
x_decr["Cube-"] = -(x_decr.X-5)**3
x_decr.head()
X | Line- | Sq- | Exp- | Cube- | |
---|---|---|---|---|---|
0 | 3.181872 | -5.363744 | -10.124309 | 0.041508 | 6.009985 |
1 | 2.180034 | -3.360068 | -4.752547 | 0.113038 | 22.424963 |
2 | 8.449385 | -15.898771 | -71.392112 | 0.000214 | -41.041680 |
3 | 3.021647 | -5.043294 | -9.130350 | 0.048721 | 7.743039 |
4 | 4.382207 | -7.764413 | -19.203736 | 0.012498 | 0.235792 |
The correlation matrix's heatmap and the plot of the variables is given below:
plot_data_corr(x_decr,["X","-2X+1","$-X^2$","$-e^X$","$-(X-5)^3$"],"blue")
Non-monotonic Functions
The examples below are for various non-monotonic functions. The last column added to the DataFrame
is that of an independent variable Rand
, which has no association with X
.
These examples should also clarify that Spearman correlation is a measure of monotonicity of a relationship between two variables. A zero coefficient does not necessarily indicate no relationship, but it does indicate that there is no monotonicity between them.
Before generating synthetic data, we'll define yet another helper function, display_corr_pairs()
, that calls display_correlation()
to display the heatmap of the correlation matrix and then plots all pairs of variables in the DataFrame
against each other using the Seaborn library.
On the diagonals, we'll display the histogram of each variable in yellow color using map_diag()
. Below the diagonals, we'll make a scatter plot of all variable pairs. As the correlation matrix is symmetric, we don't need the plots above the diagonals.
Let's also display the Pearson correlation coefficient for comparison:
def display_corr_pairs(df,color="cyan"):
s = set_title = np.vectorize(lambda ax,r,rho: ax.title.set_text("r = " +
"{:.2f}".format(r) +
'\n $\\rho$ = ' +
"{:.2f}".format(rho)) if ax!=None else None
)
r = display_correlation(df)
rho = df.corr(method="pearson")
g = sns.PairGrid(df,corner=True)
g.map_diag(plt.hist,color="yellow")
g.map_lower(sns.scatterplot,color="magenta")
set_title(g.axes,r,rho)
plt.subplots_adjust(hspace = 0.6)
plt.show()
We'll create a non-monotonic DataFrame, x_non
, with these functions of X
:
-
Parabola: \( (X-5)^2 \)
-
Sin: \( \sin (\frac{X}{10}2\pi) \)
-
Frac: \( \frac{X-5}{(X-5)^2+1} \)
-
Rand: Random numbers in the range [-1,1]
Below are the first 4 lines of x_non
:
x_non = pd.DataFrame({"X":rand.uniform(0,10,100)})
x_non["Parabola"] = (x_non.X-5)**2
x_non["Sin"] = np.sin(x_non.X/10*2*np.pi)
x_non["Frac"] = (x_non.X-5)/((x_non.X-5)**2+1)
x_non["Rand"] = rand.uniform(-1,1,100)
print(x_non.head())
X | Parabola | Sin | Frac | Rand | |
---|---|---|---|---|---|
0 | 0.654466 | 18.883667 | 0.399722 | -0.218548 | 0.072827 |
1 | 5.746559 | 0.557351 | -0.452063 | 0.479378 | -0.818150 |
2 | 6.879362 | 3.532003 | -0.924925 | 0.414687 | -0.868501 |
3 | 5.683058 | 0.466569 | -0.416124 | 0.465753 | 0.337066 |
4 | 6.037265 | 1.075920 | -0.606565 | 0.499666 | 0.583229 |
The Spearman correlation coefficient between different data pairs is illustrated below:
display_corr_pairs(x_non)
These examples show for what type of data the Spearman correlation is close to zero and where it has intermediate values. Another thing to note is that the Spearman correlation and Pearson correlation coefficient are not always in agreement with each other, so a lack of one doesn't mean a lack of another.
They're used to test correlation for different facets of data, and can't be used interchangeably. While they will be in agreement in some cases, they won't always be.
Spearman Correlation Coefficient on Linnerud Dataset
Let's apply the Spearman Correlation coefficient on an actual dataset. We have chosen the simple physical exercise dataset called linnerud
from the sklearn.datasets
package for demonstration:
import sklearn.datasets.load_linnerud
The code below loads the dataset and joins the target variables and attributes in one DataFrame
. Let's look at the first 4 rows of the linnerud
data:
d=load_linnerud()
dat = pd.DataFrame(d.data,columns=d.feature_names)
alldat=dat.join(pd.DataFrame(d.target,columns=d.target_names) )
alldat.head()
Chins | Situps | Jumps | Weight | Waist | Pulse | |
---|---|---|---|---|---|---|
0 | 5.0 | 162.0 | 60.0 | 191.0 | 36.0 | 50.0 |
1 | 2.0 | 110.0 | 60.0 | 189.0 | 37.0 | 52.0 |
2 | 12.0 | 101.0 | 101.0 | 193.0 | 38.0 | 58.0 |
3 | 12.0 | 105.0 | 37.0 | 162.0 | 35.0 | 62.0 |
4 | 13.0 | 155.0 | 58.0 | 189.0 | 35.0 | 46.0 |
Now, let's display the correlation pairs using our display_corr_pairs()
function:
display_corr_pairs(alldat)
Looking at the Spearman correlation values, we can make interesting conclusions such as:
- Higher waist values imply increasing weight values (from r = 0.81)
- More situps have lower waist values (from r = -0.72)
- Chins, situps and jumps don't seem to have a monotonic relationship with pulse, as the corresponding r values are close to zero.
Conclusions
In this guide, we discussed the Spearman rank correlation coefficient, its mathematical expression, and its computation via Python's pandas
library.
We demonstrated this coefficient on various synthetic examples and also on the Linnerrud
dataset. Spearman correlation coefficient is an ideal measure for computing the monotonicity of the relationship between two variables. However, a close to zero value does not necessarily indicate that the variables have no association between them.