Check if Elements in List Matches a Regex in Python
Introduction
Let's say you have a list of home addresses and want to see which ones reside on a "Street", "Ave", "Lane", etc. Given the variability of physical addresses, you'd probably want to use a regular expression to do the matching. But how do you apply a regex to a list? That's exactly what we'll be looking at in this Byte.
Why Match Lists with Regular Expressions?
Regular expressions are one of best, if not the best, ways to do pattern matching on strings. In short, they can be used to check if a string contains a specific pattern, replace parts of a string, and even split a string based on a pattern.
Another reason you may want to use a regex on a list of strings: you have a list of email addresses and you want to filter out all the invalid ones. You can use a regular expression to define the pattern of a valid email address and apply it to the entire list in one go. There are an endless number of examples like this as to why you'd want to use a regex over a list of strings.
Python's Regex Module
Python's re
module provides built-in support for regular expressions. You can import it as follows:
import re
The re
module has several functions to work with regular expressions, such as match()
, search()
, and findall()
. We'll be using these functions to check if any element in a list matches a regular expression.
Link: For more information on using regex in Python, check out our article, Introduction to Regular Expressions in Python
Using the match() Function
To check if any element in a list matches a regular expression, you can use a loop to iterate over the list and the re
module's match()
function to check each element. Here's an example:
import re
# List of strings
list_of_strings = ['apple', 'banana', 'cherry', 'date']
# Regular expression pattern for strings starting with 'a'
pattern = '^a'
for string in list_of_strings:
if re.match(pattern, string):
print(string, "matches the pattern")
In this example, the match()
function checks if each string in the list starts with the letter 'a'. The output will be:
apple matches the pattern
Note: The ^
character in the regular expression pattern indicates the start of the string. So, ^a
matches any string that starts with 'a'.
This is a basic example, but you can use more complex regular expression patterns to match more specific conditions. For example, here is a regex for matching an email address:
([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+
Using the search() Function
While re.match()
is great for checking the start of a string, re.search()
scans through the string and returns a MatchObject
if it finds a match anywhere in the string. Let's tweak our previous example to find any string that contains "Hello".
import re
my_list = ['Hello World', 'Python Hello', 'Goodbye World', 'Say Hello']
pattern = "Hello"
for element in my_list:
if re.search(pattern, element):
print(f"'{element}' matches the pattern.")
The output will be:
'Hello World' matches the pattern.
'Python Hello' matches the pattern.
'Say Hello' matches the pattern.
As you can see, re.search()
found the strings that contain "Hello" anywhere, not just at the start.
Using the findall() Function
The re.findall()
function returns all non-overlapping matches of pattern in string, as a list of strings. This can be useful when you want to extract all occurrences of a pattern from a string. Let's use this function to find all occurrences of "Hello" in our list.
import re
my_list = ['Hello Hello', 'Python Hello', 'Goodbye World', 'Say Hello Hello']
pattern = "Hello"
for element in my_list:
matches = re.findall(pattern, element)
if matches:
print(f"'{element}' contains {len(matches)} occurrence(s) of 'Hello'.")
The output will be:
'Hello Hello' contains 2 occurrence(s) of 'Hello'.
'Python Hello' contains 1 occurrence(s) of 'Hello'.
'Say Hello Hello' contains 2 occurrence(s) of 'Hello'.
Working with Nested Lists
What happens if our list contains other lists? Python's re
module functions won't work directly on nested lists, just like it wouldn't work with the root list in the previous examples. We need to flatten the list or iterate through each sub-list.
Let's consider a list of lists, where each sub-list contains strings. We want to find out which strings contain "Hello".
import re
my_list = [['Hello World', 'Python Hello'], ['Goodbye World'], ['Say Hello']]
pattern = "Hello"
for sub_list in my_list:
for element in sub_list:
if re.search(pattern, element):
print(f"'{element}' matches the pattern.")
The output will be:
'Hello World' matches the pattern.
'Python Hello' matches the pattern.
'Say Hello' matches the pattern.
We first loop through each sub-list in the main list. Then for each sub-list, we loop through its elements and apply re.search()
to find the matching strings.
Working with Mixed Data Type Lists
Python lists are versatile and can hold a variety of data types. This means you can have a list with integers, strings, and even other lists. This is great for a lot of reasons, but it also means you have to deal with potential issues when the data types matter for your operation. When working with regular expressions, we only deal with strings. So, what happens when we have a list with mixed data types?
import re
mixed_list = [1, 'apple', 3.14, 'banana', '123', 'abc123', '123abc']
regex = r'\d+' # matches any sequence of digits
for element in mixed_list:
if isinstance(element, str) and re.match(regex, element):
print(f"{element} matches the regex")
else:
print(f"{element} does not match the regex or is not a string")
In this case, the output will be:
1 does not match the regex or is not a string
apple does not match the regex or is not a string
3.14 does not match the regex or is not a string
banana does not match the regex or is not a string
123 matches the regex
abc123 does not match the regex or is not a string
123abc matches the regex
We first check if the current element is a string. Only then do we check if it matches the regular expression. This is because the re.match()
function expects a string as input. If you try to use it on an integer or a float, Python will throw an error.
Conclusion
Python's re
module provides several functions to match regex patterns in strings. In this Byte, we learned how to use these functions to check if any element in a list matches a regular expression. We also saw how to handle lists with mixed data types. Regular expressions can be complex, so take your time to understand them. With a bit of practice, you'll find that they can be used to solve many problems when working with strings.