Python: Validate Email Address with Regular Expressions (RegEx)

Python: Validate Email Address with Regular Expressions (RegEx)

Introduction

Regular Expressions, or RegEx for short, are expressions of patterns that can be used for text search and replace actions, validations, string splitting, and much more. These patterns consist of characters, digits and special characters, in such a form that the pattern matches certain segments of text we're searching through.

Regular Expressions are widely used for pattern-matching, and various programming languages have interfaces for representing them, as well as interacting with the matches results.

In this article, we will take a look at how to validate email addresses in Python, using Regular Expressions.

If you would like to learn more about Python's interface with Regular Expressions, read our Guide to Regular Expressions in Python!

General-Purpose Email Regular Expression

It's worth noting that there is no such regular expression that matches every possible valid email address. Although, there are expressions that can match most valid email addresses.

We need to define what kind of email address format are we looking for. The most common email format is:

(username)@(domainname).(top-leveldomain)

Thus, we can boil it down to a pattern of the @ symbol dividing the prefix from the domain segment.

The prefix is the recipient;s name - a string that may contain uppercase and lowercase letters, numbers, and some special characters like the . (dot), -(hyphen), and _ (underscore).

The domain consists of its name and a top-level domain divided by a . (dot) symbol. The domain name can have uppercase and lowercase letters, numbers, and - (hyphen) symbols. Additionally, the top-level domain name must be at least 2 characters long (either all uppercase or lowercase letters), but can be longer.

Note: There are a lot more detailed rules regarding valid emails, such as character count, more specific characters that can be used, etc. We'll be taking a look at an extended, highly fail-proof Regular Expression as defined by RFC5322 after the general-purpose approach.

In simple terms, our email Regular Expression could look like this:

(string1)@(string2).(2+characters)

This would match correctly for email addresses such as:

[email protected]
[email protected]
[email protected]

Again, using the same expression, these email addresses would fail:

[email protected]
[email protected]
[email protected]

It's worth noting that the strings shouldn't contain certain special characters, lest they break the form again. Additionally, the top-level domain can't be ... Accounting for those cases as well, we can put these rules down into a concrete expression that takes in a few more cases into account than the first representation:

([A-Za-z0-9]+[.-_])*[A-Za-z0-9][email protected][A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+

A special character in the prefix cannot be just before the @ symbol, nor can the prefix start with it, so we made sure that there is at least one alphanumeric character before and after every special character.

As for the domain, an email can contain a few top-level domains divided with a dot.

Obviously, this regex is more complicated than the first one, but it covers all of the rules we have defined for the email format. Yet again, it can probably fail to properly validate some edge case that we haven't thought of.

Validate Email Address with Python

The re module contains classes and methods to represent and work with Regular Expressions in Python, so we'll import it into our script. The method that we will be using is re.fullmatch(pattern, string, flags). This method returns a match object only if the whole string matches the pattern, in any other case it returns None.

Note: re.fullmatch() was introduced in Python 3.4, before that, re.match() was used instead. On newer versions, fullmatch() is prefered.

Let's compile() the Regular Expression from before, and define a simple function that accepts an email address and uses the expression to validate it:

import re

regex = re.compile(r'([A-Za-z0-9]+[.-_])*[A-Za-z0-9][email protected][A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+')

def isValid(email):
    if re.fullmatch(regex, email):
      print("Valid email")
    else:
      print("Invalid email")

The re.compile() method compiles a regex pattern into a regex object. It's mostly used for efficiency reasons, when we plan on matching the pattern more than once.

Now, let's test the code on some of the examples we took a look at earlier:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

isValid("[email protected]")
isValid("[email protected]")
isValid("[email protected]")
isValid("[email protected]")

This results in:

Valid email
Valid email
Invalid email
Invalid email

Awesome, we've got a functioning system!

Robust Email Regular Expression

The expression we've used above works well for the majority of cases and will work well for any reasonable application. However, if security is of higher concern, or if you enjoy writing Regular Expressions, you may opt to tighten the scope of possibilities while still allowing valid email addresses to pass.

Long expressions tend to get a bit convoluted and hard to read, and this expression is no exception:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=^_`{|}~-]+)*
|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]
|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")
@
(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
|\[(?:(?:(2(5[0-5]|[0-4][0-9])
|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])
|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]
|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

This is the RFC5322-compliant Regular Expression that covers 99.99% of input email addresses.* Explaining it with words is typically off the table, but visualizing it helps a lot:

*Image and claim are courtesy of EmailRegex.com.

This actually isn't the only expression that satisfies RFC5322. Many of them do, with varying degrees of success. A shorter version which still complies with the specification can be easily imported into Python's re.compile() method to represent an expression:

import re

regex = re.compile(r"([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\"([]!#-[^-~ \t]|(\\[\t -~]))+\")@([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\[[\t -Z^-~]*])")

def isValid(email):
    if re.fullmatch(regex, email):
        print("Valid email")
    else:
        print("Invalid email")

isValid("[email protected]")
isValid("[email protected]")
isValid("[email protected]")
isValid("[email protected]")

This also results in:

Valid email
Valid email
Invalid email
Invalid email

Conclusion

To wrap up this guide, let's revise what we've learned. There are many ways to validate emails using Regular Expressions, mostly depending on what certain format we are looking for. In relation to that, there is no one unique pattern that works for all email formats, we simply need to define the rules that we want the format to follow and construct a pattern accordingly.

Each new rule reduces the degree of freedom on the accepted addresses.

Last Updated: November 11th, 2021
Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Want a remote job?

    Prepping for an interview?

    • Improve your skills by solving one coding problem every day
    • Get the solutions the next morning via email
    • Practice on actual problems asked by top companies, like:
     
     
     

    Make Clarity from Data - Quickly Learn Data Visualization with Python

    Learn the landscape of Data Visualization tools in Python - work with Seaborn, Plotly, and Bokeh, and excel in Matplotlib!

    From simple plot types to ridge plots, surface plots and spectrograms - understand your data and learn to draw conclusions from it.

    © 2013-2021 Stack Abuse. All rights reserved.