Python for NLP: Vocabulary and Phrase Matching with SpaCy

This is the third article in this series of articles on Python for Natural Language Processing. In the previous article, we saw how Python's NLTK and spaCy libraries can be used to perform simple NLP tasks such as tokenization, stemming and lemmatization. We also saw how to perform parts of speech tagging, named entity recognition and noun-parsing. However, all of these operations are performed on individual words.

In this article, we will move a step further and explore vocabulary and phrase matching using the spaCy library. We will define patterns and then will see which phrases that match the pattern we define. This is similar to defining regular expressions that involve parts of speech.

Rule-Based Matching

The spaCy library comes with Matcher tool that can be used to specify custom rules for phrase matching. The process to use the Matcher tool is pretty straight forward. The first thing you have to do is define the patterns that you want to match. Next, you have to add the patterns to the Matcher tool and finally, you have to apply the Matcher tool to the document that you want to match your rules with. This is best explained with the help of an example.

For rule-based matching, you need to perform the following steps:

Creating Matcher Object

The first step is to create the matcher object:

import spacy
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import Matcher
m_tool = Matcher(nlp.vocab)

Defining Patterns

The next step is to define the patterns that will be used to filter similar phrases. Suppose we want to find the phrases "quick-brown-fox", "quick brown fox", "quickbrownfox" or "quick brownfox". To do so, we need to create the following four patterns:

p1 = [{'LOWER': 'quickbrownfox'}]
p2 = [{'LOWER': 'quick'}, {'IS_PUNCT': True}, {'LOWER': 'brown'}, {'IS_PUNCT': True}, {'LOWER': 'fox'}]
p3 = [{'LOWER': 'quick'}, {'LOWER': 'brown'}, {'LOWER': 'fox'}]
p4 =  [{'LOWER': 'quick'}, {'LOWER': 'brownfox'}]

In the above script,

  • p1 looks for the phrase "quickbrownfox"
  • p2 looks for the phrase "quick-brown-fox"
  • p3 tries to search for "qucik brown fox"
  • p4 looks for the phrase "quick brownfox"

The token attribute LOWER defines that the phrase should be converted into lower case before matching.

Once the patterns are defined, we need to add them to the Matcher object that we created earlier.

m_tool.add('QBF', None, p1, p2, p3, p4)

Here "QBF" is the name of our matcher. You can give it any name.

Applying Matcher to the Document

We have our matcher ready. The next step is to apply the matcher on a text document and see if we can get any match. Let's first create a simple document:

sentence = nlp(u'The quick-brown-fox jumps over the lazy dog. The quick brown fox eats well. \
               the quickbrownfox is dead. the dog misses the quick brownfox')

To apply the matcher to a document. The document is needed to be passed as a parameter to the matcher object. The result will be all the ids of the phrases matched in the document, along with their starting and ending positions in the document. Execute the following script:

phrase_matches = m_tool(sentence)
print(phrase_matches )

The output of the script above looks like this:

[(12825528024649263697, 1, 6), (12825528024649263697, 13, 16), (12825528024649263697, 21, 22), (12825528024649263697, 29, 31)]

From the output, you can see that four phrases have been matched. The first long number in each output is the id of the phrase matched, the second and third numbers are the starting and ending positions of the phrase.

To actually view the result in a better way, we can iterate through each matched phrase and display its string value. Execute the following script:

for match_id, start, end in phrase_matches:
    string_id = nlp.vocab.strings[match_id]  
    span = sentence[start:end]                   
    print(match_id, string_id, start, end, span.text)

Output:

12825528024649263697 QBF 1 6 quick-brown-fox
12825528024649263697 QBF 13 16 quick brown fox
12825528024649263697 QBF 21 22 quickbrownfox
12825528024649263697 QBF 29 31 quick brownfox

From the output, you can see all the matched phrases along with their vocabulary ids and start and ending position.

More Options for Rule-Based Matching

Official documentation from the sPacy library contains details of all the tokens and wildcards that can be used for phrase matching.

For instance, the "*" attribute is defined to search for one or more instances of the token.

Let's write a simple pattern that can identify the phrase "quick--brown--fox" or quick-brown---fox.

Let's first remove the previous matcher QBF.

m_tool.remove('QBF')

Next, we need to define our new pattern:


p1 = [{'LOWER': 'quick'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'brown'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'fox'}]
m_tool.add('QBF', None, p1)

The pattern p1 will match all the phrases where there are one or more punctuations in the phrase quick brown fox. Let's now define our document for filtering:

sentence = nlp(u'The quick--brown--fox jumps over the  quick-brown---fox')

You can see our document has two phrases quick--brown--fox and quick-brown---fox, that you should match our pattern. Let's apply our mather to the document and see the results:

phrase_matches = m_tool(sentence)

for match_id, start, end in phrase_matches:
    string_id = nlp.vocab.strings[match_id]  
    span = sentence[start:end]                   
    print(match_id, string_id, start, end, span.text)

The output of the script above looks like this:

12825528024649263697 QBF 1 6 quick--brown--fox
12825528024649263697 QBF 10 15 quick-brown---fox

From the output, you can see that our matcher has successfully matched the two phrases.

Phrase-Based Matching

In the last section, we saw how we can define rules that can be used to identify phrases from the document. In addition to defining rules, we can directly specify the phrases that we are looking for.
This is a more efficient way of phrase matching.

In this section, we will be doing phrase matching inside a Wikipedia article on Artificial intelligence.

Before we see the steps to perform phrase-matching, let's first parse the Wikipedia article that we will be using to perform phrase matching. Execute the following script:

import bs4 as bs  
import urllib.request  
import re  
import nltk

scrapped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')  
article = scrapped_data .read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:  
    article_text += p.text
    
    
processed_article = article_text.lower()  
processed_article = re.sub('[^a-zA-Z]', ' ', processed_article )  
processed_article = re.sub(r'\s+', ' ', processed_article)

The script has been explained in detail in my article on Implementing Word2Vec with Gensim Library in Python. You can go and read the article if you want to understand how parsing works in Python.

The processed_article contains the document that we will use for phrase-matching.

The steps to perform phrase matching are quite similar to rule based matching.

Create Phrase Matcher Object

As a first step, you need to create PhraseMatcher object. The following script does that:

import spacy
nlp = spacy.load('en_core_web_sm')


from spacy.matcher import PhraseMatcher
phrase_matcher = PhraseMatcher(nlp.vocab)

Notice in the previous section we created Matcher object. Here, in this case, we are creating PhraseMathcer object.

Create Phrase List

In the second step, you need to create a list of phrases to match and then convert the list to spaCy NLP documents as shown in the following script:

phrases = ['machine learning', 'robots', 'intelligent agents']

patterns = [nlp(text) for text in phrases]

Finally, you need to add your phrase list to the phrase matcher.

phrase_matcher.add('AI', None, *patterns)

Here the name of our matcher is AI.

Applying Matcher to the Document

Like rule-based matching, we again need to apply our phrase matcher to the document. However, our parsed article is not in spaCy document format. Therefore, we will convert our article into sPacy document format and will then apply our phrase matcher to the article.

sentence = nlp (processed_article)

matched_phrases = phrase_matcher(sentence)

In the output, we will have all the ids of all the matched phrases along with their start and end indexes in the document as shown below:

[(5530044837203964789, 37, 39),
 (5530044837203964789, 402, 404),
 (5530044837203964789, 693, 694),
 (5530044837203964789, 1284, 1286),
 (5530044837203964789, 3059, 3061),
 (5530044837203964789, 3218, 3220),
 (5530044837203964789, 3753, 3754),
 (5530044837203964789, 5212, 5213),
 (5530044837203964789, 5287, 5288),
 (5530044837203964789, 6769, 6771),
 (5530044837203964789, 6781, 6783),
 (5530044837203964789, 7496, 7498),
 (5530044837203964789, 7635, 7637),
 (5530044837203964789, 8002, 8004),
 (5530044837203964789, 9461, 9462),
 (5530044837203964789, 9955, 9957),
 (5530044837203964789, 10784, 10785),
 (5530044837203964789, 11250, 11251),
 (5530044837203964789, 12290, 12291),
 (5530044837203964789, 12411, 12412),
 (5530044837203964789, 12455, 12456)]

To see the string value of the matched phrases, execute the following script:

for match_id, start, end in matched_phrases:
    string_id = nlp.vocab.strings[match_id]  
    span = sentence[start:end]                   
    print(match_id, string_id, start, end, span.text)

In the output, you will see the strig value of the matched phrases as shown below:

5530044837203964789 AI 37 39 intelligent agents
5530044837203964789 AI 402 404 machine learning
5530044837203964789 AI 693 694 robots
5530044837203964789 AI 1284 1286 machine learning
5530044837203964789 AI 3059 3061 intelligent agents
5530044837203964789 AI 3218 3220 machine learning
5530044837203964789 AI 3753 3754 robots
5530044837203964789 AI 5212 5213 robots
5530044837203964789 AI 5287 5288 robots
5530044837203964789 AI 6769 6771 machine learning
5530044837203964789 AI 6781 6783 machine learning
5530044837203964789 AI 7496 7498 machine learning
5530044837203964789 AI 7635 7637 machine learning
5530044837203964789 AI 8002 8004 machine learning
5530044837203964789 AI 9461 9462 robots
5530044837203964789 AI 9955 9957 machine learning
5530044837203964789 AI 10784 10785 robots
5530044837203964789 AI 11250 11251 robots
5530044837203964789 AI 12290 12291 robots
5530044837203964789 AI 12411 12412 robots
5530044837203964789 AI 12455 12456 robots

From the output, you can see all the three phrases that we tried to search along with their start and end index and the string ids.

Stop Words

Before we conclude this article, I just wanted to touch the concept of stop words. Stop words are English words such as "the", "a", "an" etc that do not have any meaning of their own. Stop words are often not very useful for NLP tasks such as text classification or language modeling. So it is often better to remove these stop words before further processing of the document.

The spaCy library contains 305 stop words. In addition, depending upon our requirements, we can also add or remove stop words from the spaCy library.

To see the default spaCy stop words, we can use stop_words attribute of the spaCy model as shown below:

import spacy
sp = spacy.load('en_core_web_sm')
print(sp.Defaults.stop_words)

In the output, you will see all the sPacy stop words:

{'less', 'except', 'top', 'me', 'three', 'fifteen', 'a', 'is', 'those', 'all', 'then', 'everyone', 'without', 'must', 'has', 'any', 'anyhow', 'keep', 'through', 'bottom', 'get', 'indeed', 'it', 'still', 'ten', 'whatever', 'doing', 'though', 'eight', 'various', 'myself', 'across', 'wherever', 'himself', 'always', 'thus', 'am', 'after', 'should', 'perhaps', 'at', 'down', 'own', 'rather', 'regarding', 'which', 'anywhere', 'whence', 'would', 'been', 'how', 'herself', 'now', 'might', 'please', 'behind', 'every', 'seems', 'alone', 'from', 'via', 'its', 'become', 'hers', 'there', 'front', 'whose', 'before', 'against', 'whereafter', 'up', 'whither', 'two', 'five', 'eleven', 'why', 'below', 'out', 'whereas', 'serious', 'six', 'give', 'also', 'became', 'his', 'anyway', 'none', 'again', 'onto', 'else', 'have', 'few', 'thereby', 'whoever', 'yet', 'part', 'just', 'afterwards', 'mostly', 'see', 'hereby', 'not', 'can', 'once', 'therefore', 'together', 'whom', 'elsewhere', 'beforehand', 'themselves', 'with', 'seem', 'many', 'upon', 'former', 'are', 'who', 'becoming', 'formerly', 'between', 'cannot', 'him', 'that', 'first', 'more', 'although', 'whenever', 'under', 'whereby', 'my', 'whereupon', 'anyone', 'toward', 'by', 'four', 'since', 'amongst', 'move', 'each', 'forty', 'somehow', 'as', 'besides', 'used', 'if', 'name', 'when', 'ever', 'however', 'otherwise', 'hundred', 'moreover', 'your', 'sometimes', 'the', 'empty', 'another', 'where', 'her', 'enough', 'quite', 'throughout', 'anything', 'she', 'and', 'does', 'above', 'within', 'show', 'in', 'this', 'back', 'made', 'nobody', 'off', 're', 'meanwhile', 'than', 'neither', 'twenty', 'call', 'you', 'next', 'thereupon', 'therein', 'go', 'or', 'seemed', 'such', 'latterly', 'already', 'mine', 'yourself', 'an', 'amount', 'hereupon', 'namely', 'same', 'their', 'of', 'yours', 'could', 'be', 'done', 'whole', 'seeming', 'someone', 'these', 'towards', 'among', 'becomes', 'per', 'thru', 'beyond', 'beside', 'both', 'latter', 'ours', 'well', 'make', 'nowhere', 'about', 'were', 'others', 'due', 'yourselves', 'unless', 'thereafter', 'even', 'too', 'most', 'everything', 'our', 'something', 'did', 'using', 'full', 'while', 'will', 'only', 'nor', 'often', 'side', 'being', 'least', 'over', 'some', 'along', 'was', 'very', 'on', 'into', 'nine', 'noone', 'several', 'i', 'one', 'third', 'herein', 'but', 'further', 'here', 'whether', 'because', 'either', 'hereafter', 'really', 'so', 'somewhere', 'we', 'nevertheless', 'last', 'had', 'they', 'thence', 'almost', 'ca', 'everywhere', 'itself', 'no', 'ourselves', 'may', 'wherein', 'take', 'around', 'never', 'them', 'to', 'until', 'do', 'what', 'say', 'twelve', 'nothing', 'during', 'sixty', 'sometime', 'us', 'fifty', 'much', 'for', 'other', 'hence', 'he', 'put'}

You can also check if a word is a stop word or not. To do so, you can use the is_stop attribute as shown below:

sp.vocab['wonder'].is_stop

Since "wonder" is not a spaCy stop word, you will see False in the output.

To add or remove stopwords in spaCy, you can use sp.Defaults.stop_words.add() and sp.Defaults.stop_words.remove() methods respectively.

sp.Defaults.stop_words.add('wonder')

Next, we need to set the is_stop tag for wonder to 'True` as shown below:

sp.vocab['wonder'].is_stop = True

Conclusion

Phrase and vocabulary matching is one of the most important natural language processing tasks. In this article, we continued our discussion about how to use Python to perform rule-based and phrase based matching. In addition, we also saw spaCy stop words.

In the next article, we will see parts of speech tagging and named entity recognition in detail.

Author image
About Usman Malik
Paris (France) Twitter
Programmer | Blogger | Data Science Enthusiast | PhD To Be | Arsenal FC for Life