Minimum Spanning Trees - Kruskal's Algorithm

Kruskal's algorithm is one of the three most famous algorithms for finding a minimum spanning tree (MST) in a graph.

Kruskal's algorithm is a greedy algorithm that finds a globally optimal solution by finding small, local optimums and combining them. Besides that, it is still pretty useful and widely spread.

In this lesson, we'll illustrate and explain how Kruskal's algorithm works on a practical example, and then give you its detailed implementation in Python.

We've already covered a lot of topics regarding graphs in Python. If you are interested to learn more about specific topics or algorithms, you should definitely read some of them:

Note: In this lesson, we'll explain only concepts that are necessary to understand Kruskal's algorithm for finding the MST in a graph.
All the other concepts are explained in the lesson "Graph Theory and Graph-Related Algorithms", so you should definitely read it beforehand.

What is a Minimum Spanning Tree?

In simple terms, a minimum spanning tree is a tree constructed from a weighted, undirected graph, so it:

Connects all nodes (also referred to as vertices)
Has no cycles
Has the smallest possible sum of edge weights

Let's examine this informal definition to clarify all possible misconceptions about defined conditions.

The first important fact is that you can construct MST only on a weighted, undirected graph. That means that each edge of the graph has a numeric weight assigned to it and that you can traverse each edge from both directions (unlike directed graphs).

MST connects all nodes from the graph, meaning that you must be able to access any single node from every other node in the graph. Obviously, you can exclude some edges from the graph. In fact, excluding unnecessary edges from the graph reduces the sum of all edge weights, which increases the chance that the produced tree becomes the MST.

The MST must have the smallest possible sum of edge weights. Sometimes there are multiple trees with the same sum of edge weights. If that sum is the smallest possible, any one of them can be the MST.

The last fact is almost self-explanatory. As you probably know, trees are a special kind of graph. They are, in simple terms, connected graphs without cycles, and an MST is a type of tree, which implies that it must not have any cycles.

Important property: If a graph has n nodes, each spanning tree (including MST) has n-1 edges.

Kruskal's Algorithm Explained

Alongside Prim and Borůvka, Kruskal's algorithm is one of the most famous algorithms for finding the minimum spanning tree of weighted, undirected graphs. In this lesson, we'll use the same graph as in the lesson which describes Borůvka's algorithm:

Note: Only in this section we've marked each node with a letter instead of a number for better readability.

The idea behind Kruskal's algorithm is pretty simple, there are essentially two steps that you repeat until you get the MST:

sort all edges by their weight in the ascending order
pick the edge with the smallest weight and try to add it to the tree
- if it forms a cycle, skip that edge
repeat these steps until you have a connected tree that covers all nodes

For this particular graph, the sorted list of all edges will look like this:

Edge	Weight
EF	1
CF	1
DE	2
FH	3
AB	4
EH	5
GI	5
DG	6
AC	7
BD	9
EG	10
BC	11
HI	12
EI	15
BF	20

Edges EF and CF have the same weight, so you can choose either one of them as the starting edge. We'll assume that you picked the EF. At this moment, the tree that you will use to construct the MST is empty. Now, you try to add EF to the tree and check if it forms a cycle.

As you can see, adding EF didn't form any cycles, so you can keep it in the tree. Then you repeat these steps for every edge in the sorted list. Of course, you firstly check the edge with the minimum weight, and so on.

The interesting situation occurs when you try to add the EH to the tree:

Adding EH to the tree will create a cycle - EFH. Therefore, you must exclude EH from the tree, and check if you can include the next edge with the minimal weight - GI. After you add GI to the tree, all nodes are visited:

Although you visited all nodes, the tree is not connected yet. You can notice three distinct subtrees - {A, B}, {C, D, E, F, H}, and {G, I}. Those subtrees form a minimum spanning forest that covers all nodes of the initial graph but isn't an MST. MST must connect all nodes of the initial graph, so you must continue adding edges until the forest becomes one connected tree. Kruskal's algorithm ensures that those added edges are of minimum possible weight.

Note: Essentially, you can think of Kruskal's algorithm as an algorithm that forms subtrees and connects them with the edges of the minimal possible weight. That way, it creates the MST.

A few steps down the line, after adding edges DG and AC to the forest illustrated in the previous illustration, that forest becomes a connected tree. In fact, it becomes the miniumum spanning tree of the initial graph.

This process can really be appreciated through an animation:

How to Implement Kruskal's Algorithm in Python

In this section, we'll use the same graph as in the previous section. The only difference is that we'll mark each node with a number (starting from zero) instead of a letter for sake of simpler implementation.

Implementing a `Graph` Class

The first thing you need to implement is a Graph class. It represents a graph and defines all methods you might need when working with graphs.

class Graph:
    def __init__(self, num_of_nodes):
        self.m_num_of_nodes = num_of_nodes
        self.m_graph = []

    def add_edge(self, node1, node2, weight):
        self.m_graph.append([node1, node2, weight])

This is a fairly generic implementation of a Graph class but will have a few tweaks to make it compatible with Kruskal's algorithm. We'll introduce those tweaks in the following subsections.

As you probably already know, __init__ is effectively a constructor of any given class in Python. In this case, you construct a graph by calling Graph(num_of_nodes). The only two values that are stored in a Graph class are the nuber of nodes in the graph (m_num_of_nodes) and an array of edges (m_graph). That array of edges is the data structure you will use to store the graph in this example.

Every edge of any weighted graph connects exactly two nodes and has a certain weight assigned to it. The add_edge(node1, node2, weight) method illustrates that property. It adds the edge to a Graph object in the following form - [node1, node2, weight]. This is the simplest and very effective way to represent a graph in Python.

Besides constructor and addEgde(), there are a couple more methods that you will have to implement. To understand how and why we'll first explain some key areas of the implementation so you have a base knowledge before we introduce those new methods. So, let's begin.

Auxiliary Arrays - `parent` and `subtree_sizes`

Implementing Kruskal's algorithm itself should be a fairly straightforward thing to do. There are only a couple of key points that you should keep in mind. First of all, assuming that you already have a list of edges (represented in a previously described way), you should sort it from the smallest edge weight to the largest.

After that, you should initialize and maintain two auxiliary arrays - parent and subtree_sizes. The way they are constructed is of great importance, so follow along carefully. Both of them have the size that corresponds to the number of nodes in the initial graph (i.e. if the initial graph has n nodes, those two arrays have n elements).

That way, each index of those arrays directly corresponds to exactly one node from the graph. For example, you can access every single piece of information you might need about node 3 by calling parents[3] and subtree_sizes[3].

Now that you understand how to construct those two arrays, let's explain why do we need them at all. As we've stated before, Kruskal's algorithm effectively creates a set of subtrees (or a forest) and connects them with the edges of the minimal possible weight. In the beginning, you consider each individual node to be a separate tree and start connecting them. You connect subtrees until you get one tree that connects all nodes of the initial graph.

That's where the parent array comes into place. As you already know, the indices of that array represent nodes (e.g. index 3 refers to node 3 from the graph). In the parent array, you keep information about which node belongs to which subtree. Since you consider every node to be a separate subtree in the beginning, the parent array will look like the following:

As you can see, each node is the parent to itself, and the resulting tree is empty. When you chose the edge with the minimum weight and add it to the resulting tree, you actually connect two of the starting subtrees.

In the example graph, the edge with the minimal weight is the one that connects nodes 5 and 2, which are both subtrees of the initial graph. Connecting them forms a new, larger subtree, that contains nodes 2 and 5. The parent array reflects that by assigning node 2 as the parent node of node 5.

This continues until you have one single tree instead of multiple smaller subtrees.

To help you maintain parent and subtree_sizes arrays in a described way, you should implement following methods in the Graph class:

# Finds the root node of a subtree containing node `i`
def find_subtree(self, parent, i):
    if parent[i] == i:
        return i
    return self.find_subtree(parent, parent[i])

# Connects subtrees containing nodes `x` and `y`
def connect_subtrees(self, parent, subtree_sizes, x, y):
    xroot = self.find_subtree(parent, x)
    yroot = self.find_subtree(parent, y)
    if subtree_sizes[xroot] < subtree_sizes[yroot]:
        parent[xroot] = yroot
    elif subtree_sizes[xroot] > subtree_sizes[yroot]:
        parent[yroot] = xroot
    else:
        parent[yroot] = xroot
        subtree_sizes[xroot] += 1

The find_subtree() method recursively searches the parentarray and finds a subtree that contains node i.

The connect_subtrees() method connects two subtrees. One subtree contains node x and the other contains node y. Firstly, it finds those two subtrees, then compares their sizes, and connects the smaller subtree to a larger one.

Now that we've coded and explained the implementation of the Graph class alongside all of the additional methods, we can take a look at how to implement Kruskal's algorithm itself.

Kruskal's MST Algorithm

With all other methods from a Graph class explained, the implementation of Kruskal's algorithm should be fairly easy to understand. We've opted to implement the algorithm as the method in a Graph class:

def kruskals_mst(self):
    # Resulting tree
    result = []
    
    # Iterator
    i = 0
    # Number of edges in MST
    e = 0

    # Sort edges by their weight
    self.m_graph = sorted(self.m_graph, key=lambda item: item[2])
    
    # Auxiliary arrays
    parent = []
    subtree_sizes = []

    # Initialize `parent` and `subtree_sizes` arrays
    for node in range(self.m_num_of_nodes):
        parent.append(node)
        subtree_sizes.append(0)

    # Important property of MSTs
    # number of egdes in a MST is 
    # equal to (m_num_of_nodes - 1)
    while e < (self.m_num_of_nodes - 1):
        # Pick an edge with the minimal weight
        node1, node2, weight = self.m_graph[i]
        i = i + 1

        x = self.find_subtree(parent, node1)
        y = self.find_subtree(parent, node2)

        if x != y:
            e = e + 1
            result.append([node1, node2, weight])
            self.connect_subtrees(parent, subtree_sizes, x, y)
    
    # Print the resulting MST
    for node1, node2, weight in result:
        print("%d - %d: %d" % (node1, node2, weight))

The algorithm itself combines all previously described methods. First of all, it sorts a list of edges by their weight and initializes parent and subtree_sizes arrays. Then it iterates over the sorted list of edges, picks them one by one, and adds them to the resulting tree if possible. The algorithm stops when the number of edges of the resulting tree is equal to (num_of_nodes - 1). The resulting tree is the minimum spanning tree that we've been trying to construct.

Let's test this algorithm on the previously used example graph.

Testing Kruskal's Algorithm on the Example Graph

First of all, you need to create the Graph object that represents your example graph:

# Example graph has 9 nodes
example_graph = Graph(9)

Then you add all nodes from the example grah to the example_graph using add_edge() method:

example_graph.add_edge(0, 1, 4)
example_graph.add_edge(0, 2, 7)
example_graph.add_edge(1, 2, 11)
example_graph.add_edge(1, 3, 9)
example_graph.add_edge(1, 5, 20)
example_graph.add_edge(2, 5, 1)
example_graph.add_edge(3, 6, 6)
example_graph.add_edge(3, 4, 2)
example_graph.add_edge(4, 6, 10)
example_graph.add_edge(4, 8, 15)
example_graph.add_edge(4, 7, 5)
example_graph.add_edge(4, 5, 1)
example_graph.add_edge(5, 7, 3)
example_graph.add_edge(6, 8, 5)
example_graph.add_edge(7, 8, 12)

And, finally, run the algorithm:

example_graph.kruskals_mst()

That will yield the following output:

2 - 5: 1
4 - 5: 1
3 - 4: 2
5 - 7: 3
0 - 1: 4
6 - 8: 5
3 - 6: 6
0 - 2: 7

As you can see, this output describes the MST which is the same as the one illustrated in the section Kruskal's Algorithm Explained. Each row of the output represents one edge in the following manner: node1 - node2: weight. You can see the constructed MST in the following illustration.

What is the Complexity of Kruskal's Algorithm

Let's assume that you have a graph with E edges and N nodes. You will find the MST using Kruskal's algorithm in time of O(E log E), which is equivelant to O(E log N).

Conclusion

Kruskal's algorithm is one of the most used algorithms for finding a minimum spanning tree in a graph alongside Prim's and Borůvka's algorithm. Each of them has pretty much the same complexity, so it's up to you to decide which one to use.

In this article, we've illustrated the Kruskal's algorithm on a practical example and gave you a real implementation, so you can use it in your projects and understand how it works.

Lessson 9/10

You must first start the course before tracking progress.

Mark completed

Previous Next