Python's itertools - compress(), dropwhile(), takewhile() and groupby()

Introduction

In this guide, we'll take a look at how to harness the power of iterators using Python's itertools module.

The itertools module provides us with an interface for creating fast and memory-efficient iterators. These iterators can be infinite, combinatorial, or terminating.

Iterator vs Iterable

An Iterator is an intelligent pointer which can guide (iterate) us through the items of an Iterable (container) in a certain order. Consider a list of colors, as well as a list of integers:

colors = ['red', 'blue', 'pink']
ints = [1, 3, 5, 4, 2]

Even though we defined these lists in a particular order, they don't have to be stored in the same order when placed in memory:

iterators:  it1                 it2 
             V                   V
memory:     red   4   2   blue   1    3    pink   5

If we went through memory in-order, we'd get that the second element of the colors array is 4, which is why we need iterators.

The iterator's job is to find the next element of the list in the memory, no matter where it is. This is done via the next() method which returns the next element that the iterator points to. it1 would scour through the memory it has access to and return blue while it2 would return 3.

A great feature of iterators is that we can define how they search for elements in their respective iterables. We can, for instance, ask it to skip all odd numbers and return a subset. This is achieved by implementing a custom next() method or by using the built-in itertools that allow us to generate specific iterators for iterating through objects in various ways.

The iteration tools we'll be going over are:

  • compress()
  • dropwhile()
  • takewhile()
  • groupby()

Each of these iterator-building functions (they generate iterators) can be used on their own, or combined.

The compress() Function

The compress(data, selector) function creates an iterator that selectively picks the values from data according to the boolean list - selector. If a value from data corresponds to a True value in the selector list, it will be picked, and skipped otherwise.

If data and selector are not of the same size, compress() stops when either the data or selector lists have been exhausted:

# Importing the compress tool
from itertools import compress

cars = ['Audi', 'Volvo', 'Benz', 
        'BMW', 'Nissan', 'Mazda',
        'Ford']
        
selector = [True, True, False, False, 
            False, True, False]

# This makes an iterator that filters elements, 
# from data, for which selector values amount to True
my_cars = compress(cars, selector)

for each in my_cars:
    print(each)

This results in:

Audi
Volvo
Mazda

The selector can also be a list of 1's and 0's, or any truthy/falsy values.

You typically acquire these boolean lists through some sort of condition, such as:

int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
boolean_list = [True if x % 2 == 0 else False for x in int_list]

# OR

boolean_list = [1 if x % 2 == 0 else 0 for x in int_list]

print(boolean_list)

Here, we've generated a boolean_list with a True for each even number:

[False, True, False, True, False, True, False, True, False, True]

# OR

[0, 1, 0, 1, 0, 1, 0, 1, 0, 1]

Typically, to make things shorter - you'll use the compress() tool, as well as other tools, without assigning the results to a new variable:

import itertools

word =  'STACKABUSE'
selector = [1, 0, 1, 0, 0, 0, 0, 1, 1, 1]

for each in itertools.compress(word, selector ):
    print(each)

The result is:

S
A 
U 
S 
E

We can additionally, technically, mix-and-match the values in the selector with any truthy/falsy values:

from itertools import compress

cars = ['Audi', 'Volvo', 'Benz',
        'BMW', 'Nissan', 'Mazda', 'Ford']

# Empty string is falsy, non empty is truthy
selector = [True, 1, 0, 0, '', 1, 'string']

for each in compress(cars, selector):
    print(each)

Output is:

Audi
Volvo
Mazda
Ford
Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Though, it's worth noting that mixing apples and pears like this is not advised.

The dropwhile() Function

The dropwhile(criteria, sequence) function creates an iterator that drops (skips over) every element in the sequence, and returns True when passed through the criteria function.

The criteria function is typically a lambda function but doesn't have to be. Usually, if it's a simple function, it's shortened to a lambda, while complex functions aren't:

from itertools import dropwhile

int_list = [0, 1, 2, 3, 4, 5, 6]
result = list(dropwhile(lambda x : x < 3, int_list))

print(result)

Given this lambda function, every element with a value lesser than 3 will return True, so all elements lesser than 3 are skipped over. They're dropped while criteria is true:

[3, 4, 5, 6]

Instead of a lambda function, we can define a more complicated one and pass in a reference to it instead:

from itertools import dropwhile

def doesnt_contain_character(str):
    substring = 'a'
    if substring in str:
        return False
    else:
        return True
        
string_list = ['lorem', 'ipsum', 'dolor', 'sit', 'amet']
print(list(dropwhile(doesnt_contain_character, string_list)))

For instance, this method checks if a string doesn't contain a substring - in this case, just a. If the given string contains a, False is returned and if it doesn't contain it - True is returned. Thus, all words in the sequence, until amet return True, and are dropped from the result:

['amet']

However, all elements after the criteria fails will be included. In our case, everything after the 'amet' element will be included, regardless of the criteria:

from itertools import dropwhile

def doesnt_contain_character(str):
    substring = 'a'
    if substring in str:
        return False
    else:
        return True
        
string_list = ['lorem', 'ipsum', 'dolor', 'sit', 'amet', 'a', 'b']
print(list(dropwhile(doesnt_contain_character, string_list)))

This drops the elements until 'amet' and stops dropping them after that:

['amet', 'a', 'b']

The takewhile() Function

The takewhile(criteria, sequence) function is the polar opposite of dropwhile(). It preserves all elements for which the function doesn't fail. Let's rewrite the previous example to check whether a word contains a certain character:

Let's check it out:

from itertools import takewhile

def contains_character(str):
    substring = 'o'
    if substring in str:
        return True
    else:
        return False
        
string_list = ['lorem', 'ipsum', 'dolor', 'sit', 'amet']
print(list(takewhile(contains_character, string_list)))
['lorem']

Since the criteria fails on the second element, even though 'dolor' also contains the character o - it's not taken into consideration.

The groupby() Function

The groupby(iterable, key_function) is a function that generates an iterator that bundles together consecutive elements which belong to the same group. Whether an element belongs to a group or not depends on the key_function. It computes the key value for each element, the key value in this case being a specific group's id.

A cluster is ended, and a new one is created when the key_function returns a new id, even if it's been seen before.

If the key_function is not specified, then it defaults to the identity function. However, it's worth noting that even with duplicate values - they won't be clustered together if they're separated by another cluster:

from itertools import groupby

word = "aaabbbccaabbbbb"

for key, group in groupby(word):
    print(key, list(group))

Intuitively, you might expect all instances of a and b to be clustered together, but since there are clusters between them - they're separated into clusters of their own:

a ['a', 'a', 'a'] 
b ['b', 'b', 'b'] 
c ['c', 'c'] 
a ['a', 'a'] 
b ['b', 'b', 'b', 'b', 'b']

Note: The only way to avoid this is to presort the iterable based on the keys.

Now, let's define a custom key_function, which can be a lambda or dedicated function:

from itertools import groupby

some_list = [("Animal", "cat"), 
          ("Animal", "dog"),
          ("Animal", "lion"),
          ("Plant", "dandellion"),
          ("Plant", "blumen")]
  
for key, group in groupby(some_list, lambda x : x[0]):
    key_and_group = { key : list(group) }
    print(key_and_group)

We've made a list of tuples, where the first element denotes a general categorization - whether an entry is an Animal or a Plant, and the second element denotes either an animal or plant name.

Then, we've grouped these based on the first element, and printed each element of the sequence:

{'Animal': [('Animal', 'cat'), ('Animal', 'dog'), ('Animal', 'lion')]}
{'Plant': [('Plant', 'dandellion'), ('Plant', 'blumen')]}

Conclusion

In this guide, we've taken a look at the compress(), dropwhile(), takewhile() and groupby() iteration tools in Python's built-in itertools module.

If you want to learn more about the itertools module and iterators in general, feel free to check our other guides:

Last Updated: March 7th, 2023
Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms