Agglomerative Hierarchical Clustering in Python with Scikit-Learn

Agglomerative Hierarchical Clustering is an unsupervised learning algorithm that links data points based on distance to form a cluster, and then links those already clustered points into another cluster, creating a structure of clusters with sub-clusters.

It is easily implemented using Scikit-Learn which already has single, average, complete and ward linking methods available.

If you'd like to read an in-depth guide to Hierarchical Clustering, read our Hierarchical Clustering with Python and Scikit-Learn"!

To visualize the hierarchical structure of clusters, you can load the Palmer Penguins dataset, choose the columns that will be clustered, and use SciPy to plot a Dendrogram of the sub-clusters.

Note: You can download the dataset from this link.

Let's import the libraries and load the Penguins dataset, trimming it to the chosen columns and dropping rows with missing data (there were only 2):

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster import hierarchy

df = pd.read_csv('penguins.csv')
print(df.shape) # (344, 9)
df = df[['bill_length_mm', 'flipper_length_mm']]
df = df.dropna(axis=0)

We can use SciPy's hierarchy.linkage() to form clusters and plot them with hierarchy.dendrogram():

clusters = hierarchy.linkage(df, method="ward")

plt.figure(figsize=(8, 6))
dendrogram = hierarchy.dendrogram(clusters)
# Plotting a horizontal line based on the first biggest distance between clusters 
plt.axhline(150, color='red', linestyle='--'); 
# Plotting a horizontal line based on the second biggest distance between clusters 
plt.axhline(100, color='crimson'); 

This example shows how the Dendrogram is only a reference when used to choose the number of clusters. We already know that we have 3 types of penguins in the dataset, but if we were to determine their number by the Dendrogram, 2 would be our first option, and 3 would be our second option.

Now, let's perform Agglomerative Clustering with Scikit-Learn to find cluster labels for the three types of penguins:

Get free courses, guided projects, and more

No spam ever. Unsubscribe anytime. Read our Privacy Policy.

clustering_model = AgglomerativeClustering(n_clusters=3, linkage="ward")
clustering_model.fit(df)
labels = clustering_model.labels_

And plot the data before and after Agglomerative Clustering with 3 clusters:

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
sns.scatterplot(ax=axes[0], data=df, x='bill_length_mm', y='flipper_length_mm').set_title('Without clustering')
sns.scatterplot(ax=axes[1], data=df, x='bill_length_mm', y='flipper_length_mm', hue=clustering_model.labels_).set_title('With clustering');

When using Agglomerative Clustering, you don't need to predetermine the number of clusters. As we have seen in the Dendrogram, if we don't determine how many clusters we aim to have, the algorithm will usually divide points into 2 clusters.

Let's try Agglomerative Clustering without specifying the number of clusters, and plot the data without Agglomerative Clustering, with 3 clusters and with no predefined clusters:

clustering_model_no_clusters = AgglomerativeClustering(linkage="ward")
clustering_model_no_clusters.fit(df)
labels_no_clusters = clustering_model_no_clusters.labels_

And finally, let's plot the data without Agglomerative Clustering, with 3 clusters and with no predefined clusters:

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15,5))
sns.scatterplot(ax=axes[0], data=df, x='bill_length_mm', y='flipper_length_mm').set_title('Without clustering')
sns.scatterplot(ax=axes[1], data=df, x='bill_length_mm', y='flipper_length_mm', hue=clustering.labels_).set_title('With 3 clusters')
sns.scatterplot(ax=axes[2], data=df, x='bill_length_mm', y='flipper_length_mm', hue=clustering_model_no_clusters.labels_).set_title('Without choosing number of clusters');
Last Updated: November 8th, 2023
Was this helpful?
Cássia SampaioAuthor

Data Scientist, Research Software Engineer, and teacher. Cassia is passionate about transformative processes in data, technology and life. She is graduated in Philosophy and Information Systems, with a Strictu Sensu Master's Degree in the field of Foundations Of Mathematics.

Project

Bank Note Fraud Detection with SVMs in Python with Scikit-Learn

# python# machine learning# scikit-learn# data science

Can you tell the difference between a real and a fraud bank note? Probably! Can you do it for 1000 bank notes? Probably! But it...

David Landup
Cássia Sampaio
Details

© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms