Using Regex for Text Manipulation in Python

Introduction

Text preprocessing is one of the most important tasks in Natural Language Processing (NLP). For instance, you may want to remove all punctuation marks from text documents before they can be used for text classification. Similarly, you may want to extract numbers from a text string. Writing manual scripts for such preprocessing tasks requires a lot of effort and is prone to errors. Keeping in view the importance of these preprocessing tasks, the Regular Expressions (aka Regex) have been developed in different languages in order to ease these text preprocessing tasks.

A Regular Expression is a text string that describes a search pattern which can be used to match or replace patterns inside a string with a minimal amount of code. In this tutorial, we will implement different types of regular expressions in the Python language.

To implement regular expressions, the Python's re package can be used. Import the Python's re package with the following command:

import re  

Searching Patterns in a String

One of the most common NLP tasks is to search if a string contains a certain pattern or not. For instance, you may want to perform an operation on the string based on the condition that the string contains a number.

To search a pattern within a string, the match and findall function of the re package is used.

The match Function

Initialize a variable text with a text string as follows:

text = "The film Titanic was released in 1998"  

Let's write a regex expression that matches a string of any length and any character:

result = re.match(r".*", text)  

The first parameter of the match function is the regex expression that you want to search. Regex expression starts with the alphabet r followed by the pattern that you want to search. The pattern should be enclosed in single or double quotes like any other string.

The above regex expression will match the text string, since we are trying to match a string of any length and any character. If a match is found, the match function returns _sre.SRE_Match object as shown below:

type(result)  

Output:

_sre.SRE_Match  

Now to find the matched string, you can use the following command:

result.group(0)  

Output:

'The film Titanic was released in 1998'  

In case if no match is found by the match function, a null object is returned.

Now the previous regex expression matches a string with any length and any character. It will also match an empty string of length zero. To test this, update the value of text variable with an empty string:

text = ""  

Now, if you again execute the following regex expression, a match will be found:

result = re.match(r".*", text)  

Since we specified to match the string with any length and any character, even an empty string is being matched.

To match a string with a length of at least 1, the following regex expression is used:

result = re.match(r".+", text)  

Here the plus sign specifies that the string should have at least one character.

Searching Alphabets

The match function can be used to find any alphabet letters within a string. Let's initialize the text variable with the following text:

text = "The film Titanic was released in 1998"  

Now to find all the alphabet letter, both uppercase and lowercase, we can use the following regex expression:

result = re.match(r"[a-zA-z]+", text)  

This regex expression states that match the text string for any alphabets from small a to small z or capital A to capital Z. The plus sign specifies that string should have at least one character. Let's print the match found by the above expression:

print(result.group(0))  

Output:

The  

In the output, you can see that the first word i.e. The is returned. This is because the match function only returns the first match found. In the regex we specified that find the patterns with both small and capital alphabets from a to z. The first match found was The. After the word The there is a space, which is not treated as an alphabet letter, therefore the matching stopped and the expression returned just The, which is the first match.

However, there is a problem with this. If a string starts with a number instead of an alphabet, the match function will return null even if there are alphabets after the number. Let's see this in action:

text = "1998 was the year when the film titanic was released"  
result = re.match(r"[a-zA-z]+", text)  
type(result)  

Output:

NoneType  

In the above script, we have updated the text variable and now it starts with a digit. We then used the match function to search for alphabets in the string. Though the text string contains alphabets, null will be returned since match function only matches the first element in the string.

To solve this problem we can use the search function.

The search Function

The search function is similar to the match function i.e. it tries to match the specified pattern. However, unlike the match function, it matches the pattern globally instead of matching only the first element. Therefore, the search function will return a match even if the string doesn't contain an alphabet at the start of the string but contains an alphabet elsewhere in the string, as shown below:

text = "1998 was the year when the film titanic was released"  
result = re.search(r"[a-zA-z]+", text)  
print(result.group(0))  

Output:

was  

The search function returns "was" since this is the first match that is found in the text string.

Matching String from the Start

To check if a string starts with a specific word, you can use the carrot key i.e. ^ followed by the word to match with the search function as shown below. Suppose we have the following string:

text = "XYZ 1998 was the year when the film titanic was released"  

If we want to find out whether the string starts with "1998", we can use the search function as follows:

result = re.search(r"^1998", text)  
type(result)  

In the output, null will be returned since the text string doesn't contain "1998" directly at the start.

Now let's change the content text variable and add "1998" at the beginning and then check if "1998" is found at the beginning or not. Execute the following script:

text = "1998 was the year when the film titanic was released"  
if re.search(r"^1998", text):  
    print("Match found")
else:  
    print("Match not found")

Output:

Match found  

Matching Strings from the End

To check whether a string ends with a specific word or not, we can use the word in the regular expression, followed by the dollar sign. The dollar sign marks the end of the statement. Take a look at the following example:

text = "1998 was the year when the film titanic was released"  
if re.search(r"1998$", text):  
    print("Match found")
else:  
    print("Match not found")

In the above script, we tried to find if the text string ends with "1998", which is not the case.

Output:

Match not found  

Now if we update the string and add "1998" at the end of the text string, the above script will return ‘Match found' as shown below:

text = "was the year when the film titanic was released 1998"  
if re.search(r"1998$", text):  
    print("Match found")
else:  
    print("Match not found")

Output:

Match found  

Substituting text in a String

Till now we have been using regex to find if a pattern exists in a string. Let's move forward with another advanced regex function i.e. substituting text in a string. The sub function is used for this purpose.

Let's take a simple example of the substitute function. Suppose we have the following string:

text = "The film Pulp Fiction was released in year 1994"  

To replace the string "Pulp Fiction" with "Forrest Gump" (another movie released in 1994) we can use the sub function as follows:

result = re.sub(r"Pulp Fiction", "Forrest Gump", text)  

The first parameter to the sub function is the regular expression that finds the pattern to substitute. The second parameter is the new text that you want as a replacement for the old text and the third parameter is the text string on which the substitute operation will be performed.

If you print the result variable, you will see the new string.

Now let's substitute all the alphabets in our string with character "X". Execute the following script:

text = "The film Pulp Fiction was released in year 1994"  
result = re.sub(r"[a-z]", "X", text)  
print(result)  

Output:

TXX XXXX PXXX FXXXXXX XXX XXXXXXXX XX XXXX 1994  

It can be seen from the output that all the characters have been replaced except the capital ones. This is because we specified a-z only and not A-Z. There are two ways to solve this problem. You can either specify A-Z in the regular expression along with a-z as follows:

result = re.sub(r"[a-zA-Z]", "X", text)  

Or you can pass the additional parameter flags to the sub function and set its value to re.I which refers to case insensitive, as follows:

result = re.sub(r"[a-z]", "X", text, flags=re.I)  

More details about different types of flags can be found at Python regex official documentation page.

Shorthand Character Classes

There are different types of shorthand character classes that can be used to perform a variety of different string manipulation functions without having to write complex logic. In this section we will discuss some of them:

Removing Digits from a String

The regex expression to find digits in a string is \d. This pattern can be used to remove digits from a string by replacing them with an empty string of length zero as shown below:

text = "The film Pulp Fiction was released in year 1994"  
result = re.sub(r"\d", "", text)  
print(result)  

Output:

The film Pulp Fiction was released in year  

Removing Alphabet Letters from a String

text = "The film Pulp Fiction was released in year 1994"  
result = re.sub(r"[a-z]", "", text, flags=re.I)  
print(result)  

Output:

1994  

Removing Word Characters

If you want to remove all the word characters (letters and numbers) from a string and keep the remaining characters, you can use the \w pattern in your regex and replace it with an empty string of length zero, as shown below:

text = "The film, '@Pulp Fiction' was ? released in % $ year 1994."  
result = re.sub(r"\w","", text, flags = re.I)  
print(result)  

Output:

, '@ '  ?   % $  .

The output shows that all the numbers and alphabets have been removed.

Removing Non-Word Characters

To remove all the non-word characters, the \W pattern can be used as follows:

text = "The film, '@Pulp Fiction' was ? released in % $ year 1994."  
result = re.sub(r"\W", "", text, flags=re.I)  
print(result)  

Output:

ThefilmPulpFictionwasreleasedinyear1994  

From the output, you can see that everything has been removed (even spaces), except the numbers and alphabets.

Grouping Multiple Patterns

You can group multiple patterns to match or substitute in a string using the square bracket. In fact, we did this when we matched capital and small letters. Let's group multiple punctuation marks and remove them from a string:

text = "The film, '@Pulp Fiction' was ? released _ in % $ year 1994."  
result = re.sub(r"[,@\'?\.$%_]", "", text, flags=re.I)  
print(result)  

Output:

The film Pulp Fiction was  released  in   year 1994  

You can see that the string in the text variable had multiple punctuation marks, we grouped all these punctuations in the regex expression using square brackets. It is important to mention that with a dot and a single quote we have to use the escape sequence i.e. backward slash. This is because by default the dot operator is used for any character and the single quote is used to denote a string.

Removing Multiple Spaces

Sometimes, multiple spaces appear between words as a result of removing words or punctuation. For instance, in the output of the last example, there are multiple spaces between in and year. These spaces can be removed using the \s pattern, which refers to a single space.

text = "The film      Pulp Fiction      was released in   year 1994."  
result = re.sub(r"\s+"," ", text, flags = re.I)  
print(result)  

Output:

The film Pulp Fiction was released in year 1994.  

In the script above we used the expression \s+ which refers to single or multiple spaces.

Removing Spaces from Start and End

Sometimes we have a sentence that starts or ends with a space, which is often not desirable. The following script removes spaces from the beginning of a sentence:

text = "         The film Pulp Fiction was released in year 1994"  
result = re.sub(r"^\s+", "", text)  
print(result)  

Output:

The film Pulp Fiction was released in year 1994  

Similarly, to remove space at the end of the string, the following script can be used:

text = "The film Pulp Fiction was released in year 1994      "  
result = re.sub(r"\s+$", "", text)  
print(result)  

Removing a Single Character

Sometimes removing punctuation marks, such as an apostrophe, results in a single character which has no meaning. For instance, if you remove the apostrophe from the word Jacob's and replace it with space, the resultant string is Jacob s. Here the s makes no sense. Such single characters can be removed using regex as shown below:

text = "The film Pulp Fiction     s was b released in year 1994"  
result = re.sub(r"\s+[a-zA-Z]\s+", " ", text)  
print(result)  

Output:

The film Pulp Fiction was released in year 1994  

The script replaces any small or capital letter between one or more spaces, with a single space.

Splitting a String

String splitting is another very important function. Strings can be split using split function from the re package. The split function returns a list of split tokens. Let's split a string of words where one or more space characters are found, as shown below:

text = "The film      Pulp   Fiction was released in year 1994      "  
result = re.split(r"\s+", text)  
print(result)  

Output:

['The', 'film', 'Pulp', 'Fiction', 'was', 'released', 'in', 'year', '1994', '']

Similarly, you can use other regex expressions to split a string using the split functions. For instance, the following split function splits string of words when a comma is found:

text = "The film, Pulp Fiction, was released in year 1994"  
result = re.split(r"\,", text)  
print(result)  

Output:

['The film', ' Pulp Fiction', ' was released in year 1994']

Finding All Instances

The match function conducts a match on the first element while the search function conducts a global search on the string and returns the first matched instance.

For instance, if we have the following string:

text = "I want to buy a mobile between 200 and 400 euros"  

We want to search all the digits from this string. If we use the search function, only the first occurrence of digits i.e. 200 will be returned as shown below:

result = re.search(r"\d+", text)  
print(result.group(0))  

Output:

200  

On the other hand, the findall function returns a list that contains all the matched utterances as shown below:

text = "I want to buy a mobile between 200 and 400 euros"  
result = re.findall(r"\d+", text)  
print(result)  

Output:

['200', '400']

You can see from the output that both "200" and "400" is returned by the findall function.

Conclusion

In this article we studied some of the most commonly used regex functions in Python. Regular expressions are extremely useful for preprocessing text that can be further used for a variety of applications, such as topic modeling, text classification, sentimental analysis, and text summarization, etc.

Author image
About Usman Malik
Paris (France) Twitter
Programmer | Blogger | Data Science Enthusiast | PhD To Be | Arsenal FC for Life