Introduction
In this guide, we'll take a look at how to harness the power of iterators using Python's itertools
module.
The itertools
module provides us with an interface for creating fast and memory-efficient iterators. These iterators can be infinite, combinatorial, or terminating.
Iterator vs Iterable
An Iterator is an intelligent pointer which can guide (iterate) us through the items of an Iterable (container) in a certain order. Consider a list of colors, as well as a list of integers:
colors = ['red', 'blue', 'pink']
ints = [1, 3, 5, 4, 2]
Even though we defined these lists in a particular order, they don't have to be stored in the same order when placed in memory:
iterators: it1 it2
V V
memory: red 4 2 blue 1 3 pink 5
If we went through memory in-order, we'd get that the second element of the colors
array is 4
, which is why we need iterators.
The iterator's job is to find the next element of the list in the memory, no matter where it is. This is done via the next()
method which returns the next element that the iterator points to. it1
would scour through the memory it has access to and return blue
while it2
would return 3
.
A great feature of iterators is that we can define how they search for elements in their respective iterables. We can, for instance, ask it to skip all odd numbers and return a subset. This is achieved by implementing a custom next()
method or by using the built-in itertools
that allow us to generate specific iterators for iterating through objects in various ways.
The iteration tools we'll be going over are:
compress()
dropwhile()
takewhile()
groupby()
Each of these iterator-building functions (they generate iterators) can be used on their own, or combined.
The compress() Function
The compress(data, selector)
function creates an iterator that selectively picks the values from data
according to the boolean list - selector
. If a value from data
corresponds to a True
value in the selector
list, it will be picked, and skipped otherwise.
If data
and selector
are not of the same size, compress()
stops when either the data
or selector
lists have been exhausted:
# Importing the compress tool
from itertools import compress
cars = ['Audi', 'Volvo', 'Benz',
'BMW', 'Nissan', 'Mazda',
'Ford']
selector = [True, True, False, False,
False, True, False]
# This makes an iterator that filters elements,
# from data, for which selector values amount to True
my_cars = compress(cars, selector)
for each in my_cars:
print(each)
This results in:
Audi
Volvo
Mazda
The selector
can also be a list of 1
's and 0
's, or any truthy/falsy values.
You typically acquire these boolean lists through some sort of condition, such as:
int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
boolean_list = [True if x % 2 == 0 else False for x in int_list]
# OR
boolean_list = [1 if x % 2 == 0 else 0 for x in int_list]
print(boolean_list)
Here, we've generated a boolean_list
with a True
for each even number:
[False, True, False, True, False, True, False, True, False, True]
# OR
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
Typically, to make things shorter - you'll use the compress()
tool, as well as other tools, without assigning the results to a new variable:
import itertools
word = 'STACKABUSE'
selector = [1, 0, 1, 0, 0, 0, 0, 1, 1, 1]
for each in itertools.compress(word, selector ):
print(each)
The result is:
S
A
U
S
E
We can additionally, technically, mix-and-match the values in the selector
with any truthy/falsy values:
from itertools import compress
cars = ['Audi', 'Volvo', 'Benz',
'BMW', 'Nissan', 'Mazda', 'Ford']
# Empty string is falsy, non empty is truthy
selector = [True, 1, 0, 0, '', 1, 'string']
for each in compress(cars, selector):
print(each)
Output is:
Audi
Volvo
Mazda
Ford
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Though, it's worth noting that mixing apples and pears like this is not advised.
The dropwhile() Function
The dropwhile(criteria, sequence)
function creates an iterator that drops (skips over) every element in the sequence
, and returns True
when passed through the criteria
function.
The criteria
function is typically a lambda function but doesn't have to be. Usually, if it's a simple function, it's shortened to a lambda, while complex functions aren't:
from itertools import dropwhile
int_list = [0, 1, 2, 3, 4, 5, 6]
result = list(dropwhile(lambda x : x < 3, int_list))
print(result)
Given this lambda function, every element with a value lesser than 3
will return True
, so all elements lesser than 3 are skipped over. They're dropped while criteria is true:
[3, 4, 5, 6]
Instead of a lambda function, we can define a more complicated one and pass in a reference to it instead:
from itertools import dropwhile
def doesnt_contain_character(str):
substring = 'a'
if substring in str:
return False
else:
return True
string_list = ['lorem', 'ipsum', 'dolor', 'sit', 'amet']
print(list(dropwhile(doesnt_contain_character, string_list)))
For instance, this method checks if a string doesn't contain a substring
- in this case, just a
. If the given string contains a
, False
is returned and if it doesn't contain it - True
is returned. Thus, all words in the sequence, until amet
return True
, and are dropped from the result:
['amet']
However, all elements after the criteria fails will be included. In our case, everything after the 'amet'
element will be included, regardless of the criteria
:
from itertools import dropwhile
def doesnt_contain_character(str):
substring = 'a'
if substring in str:
return False
else:
return True
string_list = ['lorem', 'ipsum', 'dolor', 'sit', 'amet', 'a', 'b']
print(list(dropwhile(doesnt_contain_character, string_list)))
This drops the elements until 'amet'
and stops dropping them after that:
['amet', 'a', 'b']
The takewhile() Function
The takewhile(criteria, sequence)
function is the polar opposite of dropwhile()
. It preserves all elements for which the function doesn't fail. Let's rewrite the previous example to check whether a word contains a certain character:
Let's check it out:
from itertools import takewhile
def contains_character(str):
substring = 'o'
if substring in str:
return True
else:
return False
string_list = ['lorem', 'ipsum', 'dolor', 'sit', 'amet']
print(list(takewhile(contains_character, string_list)))
['lorem']
Since the criteria
fails on the second element, even though 'dolor'
also contains the character o
- it's not taken into consideration.
The groupby() Function
The groupby(iterable, key_function)
is a function that generates an iterator that bundles together consecutive elements which belong to the same group. Whether an element belongs to a group or not depends on the key_function
. It computes the key value for each element, the key value in this case being a specific group's id.
A cluster is ended, and a new one is created when the key_function
returns a new id, even if it's been seen before.
If the key_function
is not specified, then it defaults to the identity function. However, it's worth noting that even with duplicate values - they won't be clustered together if they're separated by another cluster:
from itertools import groupby
word = "aaabbbccaabbbbb"
for key, group in groupby(word):
print(key, list(group))
Intuitively, you might expect all instances of a
and b
to be clustered together, but since there are clusters between them - they're separated into clusters of their own:
a ['a', 'a', 'a']
b ['b', 'b', 'b']
c ['c', 'c']
a ['a', 'a']
b ['b', 'b', 'b', 'b', 'b']
Note: The only way to avoid this is to presort the iterable based on the keys.
Now, let's define a custom key_function
, which can be a lambda or dedicated function:
from itertools import groupby
some_list = [("Animal", "cat"),
("Animal", "dog"),
("Animal", "lion"),
("Plant", "dandellion"),
("Plant", "blumen")]
for key, group in groupby(some_list, lambda x : x[0]):
key_and_group = { key : list(group) }
print(key_and_group)
We've made a list of tuples, where the first element denotes a general categorization - whether an entry is an Animal or a Plant, and the second element denotes either an animal or plant name.
Then, we've grouped these based on the first element, and printed each element of the sequence:
{'Animal': [('Animal', 'cat'), ('Animal', 'dog'), ('Animal', 'lion')]}
{'Plant': [('Plant', 'dandellion'), ('Plant', 'blumen')]}
Conclusion
In this guide, we've taken a look at the compress()
, dropwhile()
, takewhile()
and groupby()
iteration tools in Python's built-in itertools
module.
If you want to learn more about the itertools
module and iterators in general, feel free to check our other guides: