# Agglomerative Hierarchical Clustering in Python with Scikit-Learn

Agglomerative Hierarchical Clustering is an unsupervised learning algorithm that links data points based on distance to form a cluster, and then links those already clustered points into another cluster, creating a structure of clusters with subclusters.

It is easily implemented using Scikit-Learn which already has single, average, complete and ward linking methods available.

If you'd like to read an in-depth guide to Hierarchical Clustering, read our Hierarchical Clustering with Python and Scikit-Learn"!

To visualize the hierarchical structure of clusters, you can load the *Palmer Penguins* dataset, choose the columns that will be clustered, and use Scipy to plot a Dendrogram of the subclusters.

**Note**: You can download the dataset from this link.

Let's import the libraries and load the Penguins dataset, trimming it to the chosen columns and dropping rows with missing data (there were only 2):

```
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster import hierarchy
df = pd.read_csv('penguins.csv')
print(df.shape) # (344, 9)
df = df[['bill_length_mm', 'flipper_length_mm']]
df = df.dropna(axis=0)
```

We can use Scipy's `hierarchy.linkage()`

to form clusters and plot them with `hierarchy.dendrogram()`

:

```
clusters = hierarchy.linkage(df, method="ward")
plt.figure(figsize=(8, 6))
dendrogram = hierarchy.dendrogram(clusters)
# Plotting a horizontal line based on the first biggest distance between clusters
plt.axhline(150, color='red', linestyle='--');
# Plotting a horizontal line based on the second biggest distance between clusters
plt.axhline(100, color='crimson');
```

This example shows how the Dendrogram is only a reference when used to choose the number of clusters. We already know that we have 3 types of penguins in the dataset, but if we were to determine their number by the Dendrogram, 2 would be our first option, and 3 would be our second option.

Now, let's perform Agglomerative Clustering with Scikit-Learn to find cluster labels for the three types of penguins:

```
clustering_model = AgglomerativeClustering(n_clusters=3, linkage="ward")
clustering_model.fit(df)
labels = clustering_model.labels_
```

And plot the data before and after Agglomerative Clustering with 3 clusters:

```
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
sns.scatterplot(ax=axes[0], data=df, x='bill_length_mm', y='flipper_length_mm').set_title('Without clustering')
sns.scatterplot(ax=axes[1], data=df, x='bill_length_mm', y='flipper_length_mm', hue=clustering_model.labels_).set_title('With clustering');
```

When using Agglomerative Clustering, you don't need to pre-determine the number of clusters. As we have seen in the Dendrogram, if we don't determine how many clusters we aim to have, the algorithm will usually divide points into 2 clusters.

Let's try Agglomerative Clustering without specifying the number of clusters, and plot the data without Agglomerative Clustering, with 3 clusters and with no predefined clusters:

```
clustering_model_no_clusters = AgglomerativeClustering(linkage="ward")
clustering_model_no_clusters.fit(df)
labels_no_clusters = clustering_model_no_clusters.labels_
```

And finally, let's plot the data without Agglomerative Clustering, with 3 clusters and with no predefined clusters:

```
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15,5))
sns.scatterplot(ax=axes[0], data=df, x='bill_length_mm', y='flipper_length_mm').set_title('Without clustering')
sns.scatterplot(ax=axes[1], data=df, x='bill_length_mm', y='flipper_length_mm', hue=clustering.labels_).set_title('With 3 clusters')
sns.scatterplot(ax=axes[2], data=df, x='bill_length_mm', y='flipper_length_mm', hue=clustering_model_no_clusters.labels_).set_title('Without choosing number of clusters');
```

### You might also like...

Data Scientist, Research Software Engineer, and teacher. Cassia is passionate about transformative processes in data, technology and life. She is graduated in Philosophy and Information Systems, with a Strictu Sensu Master's Degree in the field of Foundations Of Mathematics.

### Bank Note Fraud Detection with SVMs in Python with Scikit-Learn

# python# machine learning# scikit-learn# data scienceCan you tell the difference between a real and a fraud bank note? Probably! Can you do it for 1000 bank notes? Probably! But it...