Calculate Correlation of DataFrame Features/Columns with Pandas

Calculate Correlation of DataFrame Features/Columns with Pandas

Checking for correlation, and quantifying correlation is one of the key steps during exploratory data analysis and forming hypotheses.

Pandas is one of the most widely used data manipulation libraries, and it makes calculating correlation coefficients between all numerical variables very straightforward - with a single method call.

Let's load in a dataset from Scikit-Learn and pack it into a DataFrame:

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing

# Target column is under ch.target, the rest is under ch.data
ch = fetch_california_housing(as_frame=True)
df = pd.DataFrame(data=ch.data, columns=ch.feature_names)
df['MedHouseVal'] = ch.target

df.head()

It's loaded in correctly!

Get All Correlation Coefficients

Now, to get the correlations between all of the numerical features, we simply call df.corr() (which defaults to Pearson Correlation):

df.corr()

The method call returns a DataFrame with the correlations and the same columns:

Though, since a tabular format isn't really intuitive or readable - let's plot this as a heatmap:

Get free courses, guided projects, and more

No spam ever. Unsubscribe anytime. Read our Privacy Policy.

import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))

sns.heatmap(df.corr(), ax=ax, annot=True)

Get Correlation to Target Variable

Say we're interested in a single target variable and would like to see which features correlate with it. We'll calculate the correlations with df.corr() and then subset the resulting DataFrame to only include the target column:

corr = df.corr()[['MedHouseVal']]
sns.heatmap(corr, annot=True)

Sort Correlation Coefficients

More often than not - you'll want to sort the values as well:

corr = df.corr()[['MedHouseVal']].sort_values(by='MedHouseVal', ascending=False)
sns.heatmap(corr, annot=True)

Pearson, Spearman and Kendall Rank Coefficients with Pandas

The corr() method accepts three coefficient methods - 'pearson', 'spearman' and 'kendall':

fig, ax = plt.subplots(1,3, figsize=(18, 8))

corr1 = df.corr('pearson')[['MedHouseVal']].sort_values(by='MedHouseVal', ascending=False)
corr2 = df.corr('spearman')[['MedHouseVal']].sort_values(by='MedHouseVal', ascending=False)
corr3 = df.corr('kendall')[['MedHouseVal']].sort_values(by='MedHouseVal', ascending=False)


sns.heatmap(corr1, ax=ax[0], annot=True)
sns.heatmap(corr2, ax=ax[1], annot=True)
sns.heatmap(corr3, ax=ax[2], annot=True)
Last Updated: July 5th, 2022
Was this helpful?
David LandupAuthor

Entrepreneur, Software and Machine Learning Engineer, with a deep fascination towards the application of Computation and Deep Learning in Life Sciences (Bioinformatics, Drug Discovery, Genomics), Neuroscience (Computational Neuroscience), robotics and BCIs.

Great passion for accessible education and promotion of reason, science, humanism, and progress.

Course

Data Visualization in Python with Matplotlib and Pandas

# python# pandas# matplotlib

Data Visualization in Python with Matplotlib and Pandas is a course designed to take absolute beginners to Pandas and Matplotlib, with basic Python knowledge, and...

David Landup
David Landup
Details
Project

Data Visualization in Python: The Collatz Conjecture

# python# matplotlib# data visualization

The Collatz Conjecture is a notorious conjecture in mathematics. A conjecture is a conclusion based on existing evidence - however, a conjecture cannot be proven....

David Landup
Jovana Ninkovic
Details

© 2013-2022 Stack Abuse. All rights reserved.

DisclosurePrivacyTerms