Checking for correlation, and quantifying correlation is one of the key steps during exploratory data analysis and forming hypotheses.
Pandas is one of the most widely used data manipulation libraries, and it makes calculating correlation coefficients between all numerical variables very straightforward - with a single method call.
For more detailed and in-depth guides to Spearman and Pearson correlations, read our "Calculating Spearman's Rank Correlation Coefficient in Python with Pandas" and "Calculating Pearson Correlation Coefficient in Python with Numpy"!
Let's load in a dataset from Scikit-Learn and pack it into a
import pandas as pd import numpy as np from sklearn.datasets import fetch_california_housing # Target column is under ch.target, the rest is under ch.data ch = fetch_california_housing(as_frame=True) df = pd.DataFrame(data=ch.data, columns=ch.feature_names) df['MedHouseVal'] = ch.target df.head()
It's loaded in correctly!
Get All Correlation Coefficients
Now, to get the correlations between all of the numerical features, we simply call
df.corr() (which defaults to Pearson Correlation):
The method call returns a
DataFrame with the correlations and the same columns:
Though, since a tabular format isn't really intuitive or readable - let's plot this as a heatmap:
import seaborn as sns import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(10, 6)) sns.heatmap(df.corr(), ax=ax, annot=True)
Get Correlation to Target Variable
Say we're interested in a single target variable and would like to see which features correlate with it. We'll calculate the correlations with
df.corr() and then subset the resulting
DataFrame to only include the target column:
corr = df.corr()[['MedHouseVal']] sns.heatmap(corr, annot=True)
Sort Correlation Coefficients
More often than not - you'll want to sort the values as well:
corr = df.corr()[['MedHouseVal']].sort_values(by='MedHouseVal', ascending=False) sns.heatmap(corr, annot=True)
Pearson, Spearman and Kendall Rank Coefficients with Pandas
corr() method accepts three coefficient methods -
fig, ax = plt.subplots(1,3, figsize=(18, 8)) corr1 = df.corr('pearson')[['MedHouseVal']].sort_values(by='MedHouseVal', ascending=False) corr2 = df.corr('spearman')[['MedHouseVal']].sort_values(by='MedHouseVal', ascending=False) corr3 = df.corr('kendall')[['MedHouseVal']].sort_values(by='MedHouseVal', ascending=False) sns.heatmap(corr1, ax=ax, annot=True) sns.heatmap(corr2, ax=ax, annot=True) sns.heatmap(corr3, ax=ax, annot=True)
You might also like...
Entrepreneur, Software and Machine Learning Engineer, with a deep fascination towards the application of Computation and Deep Learning in Life Sciences (Bioinformatics, Drug Discovery, Genomics), Neuroscience (Computational Neuroscience), robotics and BCIs.
Great passion for accessible education and promotion of reason, science, humanism, and progress.
Building Your First Convolutional Neural Network With Keras# python# artificial intelligence# machine learning# tensorflow
Most resources start with pristine datasets, start at importing and finish at validation. There's much more to know. Why was a class predicted? Where was...