Heap Sort in Python

Introduction

Heap Sort is another example of an efficient sorting algorithm. Its main advantage is that it has a great worst-case runtime of O(n*logn) regardless of the input data.

As the name suggests, Heap Sort relies heavily on the heap data structure - a common implementation of a Priority Queue.

Without a doubt, Heap Sort is one of the simplest sorting algorithms to implement and coupled with the fact that it's a fairly efficient algorithm compared to other simple implementations, it's a common one to encounter.

Heap Sort

Heap Sort works by "removing" elements from the heap part of the array one-by-one and adding them to the sorted part of the array. Before we get further into the explanation and revisit the heap data structure, we should mention a few attributes of Heap Sort itself.

It is an in-place algorithm, meaning that it requires a constant amount of additional memory, i.e. the memory needed doesn't depend on the size of the initial array itself, other than the memory needed to store that array.

For example, no copies of the original array are necessary, and there is no recursion and recursive call stacks. The simplest implementation of Heap Sort usually uses a second array to store the sorted values. We will be using this approach since it's a lot more intuitive and easy to follow in code, but it can be implemented completely in-place.

Heap Sort is unstable, meaning that it does not maintain the relative order of elements with equal values. This isn't an issue with primitive types (like integers and characters...) but it can be a problem when we sort complex types, like objects.

For example, imagine we have a custom class Person with the age and name fields, and several objects of that class in an array, including a person called "Mike" aged 19 and "David", also aged 19 - appearing in that order.

If we decided to sort that array of people by age, there would be no guarantee that "Mike" would appear before "David" in the sorted array, even though they appeared in that order in the initial array. It can happen, but it's not guaranteed.

Fun fact: Heap Sort is the sorting algorithm of choice in the Linux Kernel

The Heap Data Structure

Heaps are one of the most popular and heavily used data structures in computer science - not to mention very popular during Software Engineering interviews.

We'll talk of heaps keeping track of the smallest element (min-heap), but they can just as easily be implemented to keep track of the largest element (max-heap).

Simply put, a min-heap is a tree-based data structure in which every node is smaller that all of its children. Most often a binary tree is used. Heaps have three supported operations - delete_minimum(), get_minimum(), and add().

You can only delete the first element in the heap, after which it's "re-sorted". Heaps "re-sort" themselves after an element is added or removed, so that the the smallest element is always in the first position.

Note: This in no way means that heaps are sorted arrays. The fact that every node is smaller than its children isn't enough to guarantee that the whole heap is in ascending order.

Let's look at an example of a heap:

As we can see, the above example does fit the description of a heap but is not sorted. We won't go into details of the heap implementation since that is not the focus of this article. The crucial advantage of the heap data structure we leverage when using it in Heap Sort is that the next smallest element is always the first element in the heap.

Note: Thanks to the way heaps sort elements after an element is removed, the complexity of the next smallest element moving to the first position, while keeping the array a heap still, takes O(logn) time, which is a highly efficient operation.

Implementation

Sorting Arrays

Python provides methods for creating and using heaps so we don't have to implement them ourselves:

  • heappush(list, item): Adds an element to the heap, and re-sorts it afterward so that it remains a heap. Can be used on an empty list.
  • heappop(list): Pops (removes) the first (smallest) element and returns that element. The heap remains a heap after this operation, so we don't have to call heapify().
  • heapify(list): Turns the given list into a heap. It is worth noting that this method exists even though we won't be using this since we don't want to change our original array.

Now that we know this, the implementation for Heap Sort is fairly straight-forward:

from heapq import heappop, heappush

def heap_sort(array):
    heap = []
    for element in array:
        heappush(heap, element)

    ordered = []

    # While we have elements left in the heap
    while heap:
        ordered.append(heappop(heap))

    return ordered

array = [13, 21, 15, 5, 26, 4, 17, 18, 24, 2]
print(heap_sort(array))

Output:

[2, 4, 5, 13, 15, 17, 18, 21, 24, 26]

As we can see, the heavy lifting is done with the heap data structure, all we have to do is add all the elements we need and remove them one by one. It's almost like a coin counting machine that sorts the inputted coins by their value and we can take them out afterwards.

Sorting Custom Objects

Things get a little more complicated when using custom classes. Usually, we advise against overriding comparison operators in classes for the purpose of using our sorting algorithms for them, and instead suggest rewriting the algorithm so that it takes a lambda function comparator instead.

However, since our implementation relies on the built-in heap methods, we can't do that here.

Python does provide the following methods:

  • heapq.nlargest(*n*, *iterable*, *key=None*): Returns a list with the n largest elements from the dataset defined by iterable.
  • heapq.nsmallest(*n*, *iterable*, *key=None*): Returns a list with the n smallest elements from the dataset defined by iterable.

Which we could use to simply get n = len(array) largest/smallest elements but the methods themselves do not use Heap Sort and are essentially equivalent to just calling the sorted() method.

The only solution we have left for custom classes is to actually override the comparison operators. This sadly limits us to only one type of comparison per class. In our example it limits us to sorting Movie objects by year.

However, it does let us demonstrate using Heap Sort on custom classes. Let's go ahead and define the Movie class:

from heapq import heappop, heappush

class Movie:
    def __init__(self, title, year):
        self.title = title
        self.year = year

    def __str__(self):
        return str.format("Title: {}, Year: {}", self.title, self.year)

    def __lt__(self, other):
        return self.year < other.year

    def __gt__(self, other):
        return other.__lt__(self)

    def __eq__(self, other):
        return self.year == other.year

    def __ne__(self, other):
        return not self.__eq__(other)

And now, let's slightly modify our heap_sort() function:

def heap_sort(array):
    heap = []
    for element in array:
        heappush(heap, element)

    ordered = []

    while heap:
        ordered.append(heappop(heap))

    return ordered

And finally, let's instantiate a few movies, put them in an array, and then sort them:

movie1 = Movie("Citizen Kane", 1941)
movie2 = Movie("Back to the Future", 1985)
movie3 = Movie("Forrest Gump", 1994)
movie4 = Movie("The Silence of the Lambs", 1991);
movie5 = Movie("Gia", 1998)

array = [movie1, movie2, movie3, movie4, movie5]

for movie in heap_sort(array):
    print(movie)

Output:

Title: Citizen Kane, Year: 1941
Title: Back to the Future, Year: 1985
Title: The Silence of the Lambs, Year: 1991
Title: Forrest Gump, Year: 1994
Title: Gia, Year: 1998

Comparison to Other Sorting Algorithms

One of the main reasons Heap Sort is still used fairly often, even though it's often outperformed by a well-implemented Quick Sort, is its reliability.

Heap Sort's main advantage here are the O(n*logn) upper bound as far as time complexity is concerned, and security concerns. Linux kernel developers give the following reasoning to using Heap Sort over Quick Sort:

Sorting time of Heap Sort is O(n*logn) both on average and worst-case. While qsort is about 20% faster on average, it suffers from an exploitable O(n*n) worst-case behavior and extra memory requirements that make it less suitable for kernel use.

Furthermore, Quick Sort behaves poorly in predictable situations, and given enough knowledge of the internal implementation, it could create a security risk (mainly DDoS attacks) since the bad O(n2) behavior could easily be triggered.

Another algorithm that Heap Sort is often compared to is Merge Sort, which has the same time complexity.

Merge Sort has the advantage of being stable and intuitively parallelizable, while Heap Sort is neither.

Another note is that Heap Sort is slower than Merge Sort in most cases, even though they have the same complexity, since Heap Sort has larger constant factors.

Heap Sort can, however, be implemented much more easily in-place than Merge Sort can, so it's preferred when memory is a more important factor than speed.

Conclusion

As we saw, Heap Sort isn't as popular as other efficient, general-purpose algorithms but its predictable behavior (other than being unstable) make it a great algorithm to use where memory and security are more important than slightly faster run-time.

It's really intuitive to implement and leveraging the built-in functionality provided with Python, all we essentially have to do is put the items in a heap and take them out - similar to a coin counter.

Author image
Hey guys, I want to point out that I don't have any social media - since there have been some messages to colleague of mine with the same name. If you need any help - post it in the comments :)