Introduction
Regular Expressions (RegEx) are one of the most powerful tools in programming, yet they're also commonly misunderstood. They help you match patterns in a flexible, dynamic and efficient way, as well as allow you to perform operations based on the results.
This can include validating certain patterns that exist in a certain text, finding these matches, extracting and replacing them, etc.. For example, have you ever tried to sign up to a website and found out they rejected your password for not including numbers or capital letters? There is a good chance this website used Regular Expressions to make sure you put the right characters.
In this guide, we're going to take a deep dive into Regular Expressions, how they work and how to use them in Java. We'll mainly be taking a look at the Pattern
and Matcher
classes of the regex
package, followed by some practical examples and common tasks.
If you'd like to read more about the built-in support for Regular Expressions with Java Strings - read our Java: Guide to Built-in String RegEx Support!
What are Regular Expressions?
Regular Expressions (RegEx) are patterns used to match characters in some text. These patterns are called search patterns and allow us to find a given pattern in a certain string or sets of strings. We can validate the presence of this pattern, count its instances, and then extract it or replace it easily, when found.
Java Regular Expression Classes
Java's standard API provides us with several classes to work with Regular Expressions, straight out of the box:
MatchResult
interfaceMatcher
classPattern
classPatternSyntaxException
All of these fit snugly into the java.util.regex
package, which can easily be imported as:
// Importing all of the classes/interfaces from the regex package
import java.util.regex.*;
// You can alternatively import certain classes individually
// To reduce overhead
import java.util.regex.Pattern;
import java.util.regex.Matcher;
The Pattern class
A Pattern
instance is the compiled representation of a certain Regular Expression. The Pattern
doesn't have any public constructors, but rather uses the .compile()
method to create and return a Pattern
instance.
The .compile()
method takes in a few parameters, but two are mainly used. The first argument is the Regular Expression in string format and the second is the match flag. The match flag can be set to include CASE_INSENSITIVE
, LITERAL
, MULTILINE
, or several other options.
Let's create a Pattern
instance with a string-represented Regular Expression:
Pattern p = Pattern.compile("Stack|Abuse");
System.out.println(p);
This outputs the following:
Stack|Abuse
This isn't an output that's too surprising - it's pretty much the same as the string we passed into the Pattern
constructor. The class itself won't help us much on its own, though - we have to use a Matcher
to actually match the compiled RegEx against some string.
The Matcher
instance for a Pattern
can easily be created via the matcher()
method of the Pattern
instance:
Pattern p = Pattern.compile("Stack|Abuse", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("If you keep calling the method many times, you'll perform abuse on the stack.");
This Matcher
can then be used to put the compiled pattern to use.
The Matcher Class
The Matcher
class has several methods that allow us to actually put a compiled pattern to use:
Method | Description | Returns | |
.matches() | It checks whether the Regex matches the given input. | Boolean | |
.group() | It extracts the matched subsequence. | String | |
.start() | It gets the starting index of the matched subsequence. | int | |
.end() | It gets the ending index of the matched subsequence. | int | |
.find() | It finds the next available expression that matches the Regex pattern. | Boolean | |
.find(int start) | It finds the next available expression that matches the Regex pattern starting at a given index. | Boolean | |
.groupCount() | It finds the total number of matches. | int |
With these, you can get pretty creative in terms of logic - finding the starting indices of sequences, the total number of matches, the sequences themselves and even extracting and returning them. However, these methods might not be as intuitive as they seem to be.
Note: Please note that matches()
checks the entire string, not a certain section. find()
iterates through the string, and returns true on each occurrence.
Typically, the find()
method it used with a while()
loop:
while (m.find()) {
System.out.println(String.format("Matched sequence: %s", m.group()));
System.out.println(String.format("Start and end of sequence: %s %s \n", m.start(), m.end()));
}
This results in:
Matched sequence: abuse
Start and end of sequence: 58 63
Matched sequence: stack
Start and end of sequence: 71 76
Additionally, each group is a parentheses-delimited value within the Pattern
. In our case - there are no group as there's no parentheses encompassing Stack|Abuse
. The groupCount()
call will thus always return 0
on our Pattern
. The group()
method depends on this distinction too, and you can even get given groups by passing in their indices in the compiled pattern.
Let's turn this RegEx into two groups:
Pattern p = Pattern.compile("(Stack)|(Abuse)", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("If you keep calling the method many times, you'll perform abuse on the stack.");
System.out.println("Number of groups: " + m.groupCount());
while (m.find()) {
System.out.println(String.format("Matched sequence: %s", m.group()));
System.out.println(String.format("Start and end of sequence: %s %s\n", m.start(), m.end()));
}
Number of groups: 2
Matched sequence: abuse
Start and end of sequence: 58 63
Matched sequence: stack
Start and end of sequence: 71 76
The group()
method allows you to extract groups, even based on their indices or names, from a given string, after it's been matched. But be weary about iteration - lest you end up running into null
matches or IllegalStateExceptions
.
Once you start iterating through a pattern, it's globally changed.
Thus, if you want to get different groups, such as to say, extract groups in string date-time representations or the host of an email address, you should either iterate through the string via find()
and get the next available group via m.group()
or run matches()
and get the groups manually:
Pattern p = Pattern.compile("(Stack)(Abuse)", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("StackAbuse");
System.out.println("Number of groups: " + m.groupCount());
if(m.matches()) {
System.out.println(String.format("Group 1: '%s' \nGroup 2: '%s'", m.group(1), m.group(2)));
}
Number of groups: 2
Group 1: 'Stack'
Group 2: 'Abuse'
The matches()
class will only ever return true
if the entire sequence matches the RegEx, and in our case - this is the only input it'll fire for.
More on groups in a later section.
Anatomy of Regular Expressions
Once acquainted with the classes Java uses to represent Regular Expressions and the classes it uses to actually match the sequences in strings - let's get into Regular Expressions themselves.
Regular Expressions don't only consist of string literals, like we've used them so far. They comprise of metacharacters, quantifiers, escape characters and groups. Let's take a look at these individually.
Metacharacters
Metacharacters, as the name implies, provide meta information about the RegEx, and allow us to create dynamic expressions, rather than just literal static ones. A metacharacter has a special meaning within a Regular Expression and won't be matched as a literal string, and they're used as wildcards or stand-ins for various patterns of sequences.
Some of the most commonly used metacharacters are:
Metacharacter | Meaning |
. | Find a match of one character |
^ | Find a match at the beginning of a string |
$ | Find a match at the end of a string |
\d | Find a digit |
\D | Find a non-digit |
\s | Find a whitespace character |
\S | Find a non-whitespace character |
\w | Find a word character [a-zA-Z_0-9] |
\W | Find a non-word character |
\b | Find a match bounded by a word |
\B | Find a non-word boundary match |
You can use any number of these metacharacters, though for longer expressions - they can get a bit messy.
For instance, let's change our previous Regular Expression pattern with one that searches for a sequence that starts with a capital letter, contains a 4-letter sequence after that, and ends with "Stack":
Pattern p = Pattern.compile("^(H)(....)(Stack)$", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("HelloStack");
while (m.find()) {
System.out.println(String.format("Matched sequence: %s", m.group()));
System.out.println(String.format("Start and end of sequence: %s %s\n", m.start(), m.end()));
}
Matched sequence: HelloStack
Start and end of sequence: 0 10
Though, using just metacharacters limits us to a degree. What if we wanted to check for any sequence of characters, instead of 4?
Quantifiers
Quantifiers are a set of characters that allow us to define quantities of metacharacters that match
Quantifier | Meaning |
n+ | Find a match of at least one or more of n |
n* | Find a match of 0 or more of n |
n? | Find a match of 1 or not at all of n |
n{x} | Find a match that contains the sequence of n for x times |
n{x, y} | Find a match that contains the sequence of n between x and y times |
n{x,} | Find a match that contains the sequence of n for at least x times |
So, we could easily tweak our previous RegEx with these. For instance, let's try to match a string within another string that starts with "Hello", followed by any sequence of characters, and finishes with three exclamation marks:
Pattern p = Pattern.compile("(Hello)(.*)(!{3})$", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("I wake up and think go myself: Hello Wonderful World!!!");
while (m.find()) {
System.out.println(String.format("Matched sequence: %s", m.group()));
System.out.println(String.format("Start and end of sequence: %s %s\n", m.start(), m.end()));
}
This results in:
Matched sequence: Hello Wonderful World!!!
Start and end of sequence: 31 55
Escape Characters
If you'd like to escape the effects of any special character, such as a metacharacter or a quantifier - you can escape them by prefixing them with a \
. However, since we're defining a RegEx within a string, you'll have to escape the escape character as well. For instance, if you want to match for a dollar sign, which would typically mean matching if a given sequence is found at the end of a string - you'd escape its effects, and escape the escape character itself:
Pattern p = Pattern.compile("$", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("It costs $2.50");
Pattern p2 = Pattern.compile("\\$", Pattern.CASE_INSENSITIVE);
Matcher m2 = p.matcher("It costs $2.50");
The first matcher matches whether the string ends with the sequence prefixing the $
character, which is blank in this case. This is true
, since the string ends with, well, nothing - the pattern would be found at the end, on the 14th index. In the first matcher, we're matching for the actual dollar sign, which matches the string at the correct index in our input.
Neither of these two code snippets would result in an exception, so be careful to check whether your Regular Expressions fail silently, like in the first case.
Groups
We've used groups a bit until now - they allow us to find matches for multiple sets. You can group any number of sets together o as separate sets. Oftentimes, groups are used to allow you segregate some input into known sections, and then extract them, such as dissecting an email address into the name, symbol and host.
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Group 0 denotes the entire pattern, while all other groups are named as Group 1, Group 2, Group n...
Pattern → (A)(B)(C)
Group 0 denotes the entire pattern, Group 1 is A, Group 2 is B and Group 3 is C.
String email = "[email protected]";
// The entire expresion is group 0 -> Trying to match an email value
// The first group is trying to match any character sequence
// The second group is trying to match the @ symbol
// The third group is trying to match the host name as any sequence of characters
// The final group is trying to check whether the organization type consists of 3 a-z characters
String email = "[email protected]";
Pattern pattern = Pattern.compile("(.*)(@)(.*)(.[a-z]{3})");
Matcher matcher = pattern.matcher(email);
if (matcher.find()) {
System.out.println("Full email: " + matcher.group(0));
System.out.println("Username: " + matcher.group(1));
System.out.println("Hosting Service: " + matcher.group(3));
System.out.println("TLD: " + matcher.group(4));
}
Note: The \w
denotes a word and is shorthand for [a-zA-Z_0-9]
. Any word containing any combination of lowercase and/or uppercase characters, as well as numbers.
This code results in:
Full email: [email protected]
Username: someone
Hosting Service: gmail
TLD: com
Regular Expression Uses and Java Examples
Some of the most common use cases of Regular Expressions are validation, searching and extraction and replacement. In this section, let's use the rules we've laid out so far to validate, search and extract, as well as replace certain patterns of text. Following these tasks, we'll perform some common tasks, such as matching digits, single or multiple characters, etc.
Validate String in Java with Regular Expressions
You can validate whether a certain pattern is present in text, which can be as simple as a single word, or one of the various combinations you can produce with different metacharacters, characters and quantifiers. A simple example could be finding whether a word is present in some text:
In this part, we'll check if a certain pattern, in this case just a word, is in a text. Of course, you can still validate if a certain pattern exists in a text. We're going to look for the word "validate" in a sample text.
Pattern pattern = Pattern.compile("validate");
String longText = "Some sort of long text that we're looking for something in. " +
"We want to validate that what we're looking for is here!";
Matcher matcher = pattern.matcher(longText);
boolean found = matcher.find();
System.out.println(found);
This results in:
true
A more realistic example would be validating an email address, to check whether someone has really input a valid address or just used some spam value. A valid email contains some character sequence, followed by a @
symbol, a hostname (another character sequence) and an organization signifier, which contains three letters, and could be any combination - edu
, com
, org
, etc.
Knowing this, to validate an email address using RegEx in Java, we'll compile the expression and use the matches()
method to check whether it's valid:
Pattern pattern = Pattern.compile("\\w*[@]\\w*[.][a-z]{3}");
Matcher matcher = pattern.matcher("[email protected]");
boolean match = matcher.matches();
System.out.println(match);
This results in:
true
Find and Extract Pattern in Java with Regular Expressions
Oftentimes, other than just validation - you want to find the starting and ending points of a given sequence. With this, you could create performant Find features for text editor applications, automating the searching process. Additionally, you could shorten the search of keywords on a page, applicant letter or any sort of text by finding the sequences you're interested in, and say, highlighting them for a human operator.
To find the start and end of a sequence using Regular Expressions, as we've seen before, we can use the start()
and end()
methods of the Matcher
instance:
Pattern pattern = Pattern.compile("(search|match)");
String searchText = "You can easily search for a keyword in text using RegEx. " +
"A keyword is just a sequence of characters, that are easy to match.";
Matcher matcher = pattern.matcher(searchText);
while (matcher.find()) {
System.out.println("Found keyword: " + matcher.group());
System.out.println("Start index is: " + matcher.start());
System.out.println("End index is: " + matcher.end() + "\n");
}
The output will be as follows:
Found keyword: search
Start index is: 15
End index is: 21
Found keyword: match
Start index is: 118
End index is: 123
Here, we've also extracted the keywords - you can log them for analytical purposes, output them to a terminal, such as this, or otherwise manipulate them or act upon them. You could treat certain keywords in text as gateways to running other methods or commands.
For instance, when creating chat rooms or other applications where a user may communicate with other users - certain words may be censored to preserve a positive experience. In other cases, certain words may raise a red flag for human operators, where it may appear that a given user is inciting behavior that shouldn't be incited:
Pattern pattern = Pattern.compile("(fudge|attack)");
String message = "We're launching an attack at the pudding palace." +
"Make way through all the fudge, the King lies beyond the chocolate!";
Matcher matcher = pattern.matcher(message);
while (matcher.find()) {
System.out.println("Found keyword: " + matcher.group());
System.out.println("Start index is: " + matcher.start());
System.out.println("End index is: " + matcher.end());
if(matcher.group().equals("fudge")) {
System.out.println("This word might be inappropriate!");
} else if(matcher.group().equals("attack")) {
System.out.println("911? There's an attack going on!");
}
}
Though, things might not be as grim as you imagine them to be:
Found keyword: attack
Start index is: 19
End index is: 25
911? There's an attack going on!
Found keyword: fudge
Start index is: 73
End index is: 78
This word might be inappropriate!
Censorship isn't cool.
Extracting Email Addresses from Text
What if you just got a bunch of text containing email addresses and you'd like to extract them, if they're valid addresses? This isn't uncommon when scraping web pages for, say, contact info.
Note: Web scraping, when done, should be done ethically, and only if a website's robot.txt
file allows you to. Make sure you're ToS-compliant, and that you don't spam a website's traffic and connections, causing damage to other users and the owners of the website.
Let's go ahead and parse some "scraped" text to extract email addresses from it:
Pattern pattern = Pattern.compile("\\w*[@]\\w*[.][a-z]{3}");
String text = "We want to extract all email in this text. " +
"Yadda yadda, some more text." +
"[email protected]\n" +
"[email protected]\n" +
"[email protected]\n";
Matcher matcher = pattern.matcher(text);
List<String> emailList = new ArrayList<>();
while(matcher.find()) {
emailList.add(matcher.group());
}
System.out.println(emailList);
The output will be all emails found in the text:
[april@treutel.com, arvid@larkin.net, wrowe@quigley.org]ß
Matching Single Characters
To match a single character, as we've seen before, we simply denote it as .
:
Pattern pattern = Pattern.compile(".tack");
Matcher matcher = pattern.matcher("Stack");
boolean match = matcher.matches();
System.out.println(match);
This results in:
true
Matching Multiple Characters
Matching for multiple characters can be boiled down to a quantified .
, but much more commonly - you'll use a range of characters instead. For instance, let's check if a given string has any number of characters, belonging in the range of the alphabet:
Pattern pattern = Pattern.compile("[a-z]+");
Matcher matcher = pattern.matcher("stack");
boolean match = matcher.matches();
System.out.println(match);
Pattern pattern2 = Pattern.compile("[a-z]+");
Matcher matcher2 = pattern2.matcher("stack99");
boolean match2 = matcher2.matches();
System.out.println(match2);
This results in:
true
false
The second check returns false
as the input string doesn't only contain the characters belonging to the lowercase alphabet - but also numbers.
Matching Word Sequences
Instead of alphabet ranges, you can also match patterns of \w
- which is a shorthand for [a-zA-Z_0-9]
:
Pattern pattern = Pattern.compile("\\w*");
Matcher matcher = pattern.matcher("stack");
boolean match = matcher.matches();
System.out.println(match);
Pattern pattern2 = Pattern.compile("\\w*");
Matcher matcher2 = pattern2.matcher("stack!");
boolean match2 = matcher2.matches();
System.out.println(match2);
This results in:
true
false
Matching Non-Word Sequences
Similar to \w
, \W
is another short-hand. It's a shorthand version for non-word sequences. It's essentially a reverse of \w
, excluding all characters that fall into the category of [a-zA-Z_0-9]
:
Pattern pattern = Pattern.compile("\\W*");
Matcher matcher = pattern.matcher("stack");
boolean match = matcher.matches();
System.out.println(match);
Pattern pattern2 = Pattern.compile("\\W*");
Matcher matcher2 = pattern2.matcher("?????");
boolean match2 = matcher2.matches();
System.out.println(match2);
This results in:
false
true
?
isn't in the [a-zA-Z_0-9]
range, so the second matcher returns false
.
Matching Digits and Non-Digits
Checking whether one digit is present, we can use \d
, and checking for any number of digits is as easy as applying a wildcard to it. Following the same convention as earlier, \D
denotes non-digits instead of digits:
Pattern pattern = Pattern.compile("\\d*");
Matcher matcher = pattern.matcher("999");
boolean match = matcher.matches();
Pattern pattern2 = Pattern.compile("\\D*");
Matcher matcher2 = pattern2.matcher("https://www.youtube.com/watch?v=dQw4w9WgXcQ");
boolean match2 = matcher2.matches();
System.out.println(match);
System.out.println(match2);
The output will be the following:
true
true
Conclusion
Regular Expressions (RegEx) are one of the most powerful tools in programming, yet they're also commonly misunderstood. They help you match patterns in a flexible, dynamic and efficient way, as well as allow you to perform operations based on the results.
They can be daunting, as complex sequences tend to get very unreadable, however, they remain one of the most useful tools today. In this guide, we've gone over the basics of Regular Expressions and how to use the regex
package to perform pattern matching in Java.