This is the third article in this series of articles on Python for Natural Language Processing. In the previous article, we saw how Python's NLTK and spaCy libraries can be used to perform simple NLP tasks such as tokenization, stemming and lemmatization. We also saw how to perform parts of speech tagging, named entity recognition and noun-parsing. However, all of these operations are performed on individual words.
In this article, we will move a step further and explore vocabulary and phrase matching using the spaCy library. We will define patterns and then will see which phrases that match the pattern we define. This is similar to defining regular expressions that involve parts of speech.
Rule-Based Matching
The spaCy library comes with a Matcher
tool that can be used to specify custom rules for phrase matching. The process to use the Matcher
tool is pretty straight forward. The first thing you have to do is define the patterns that you want to match. Next, you have to add the patterns to the Matcher
tool and finally, you have to apply the Matcher
tool to the document that you want to match your rules with. This is best explained with the help of an example.
For rule-based matching, you need to perform the following steps:
Creating Matcher Object
The first step is to create the matcher object:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
m_tool = Matcher(nlp.vocab)
Defining Patterns
The next step is to define the patterns that will be used to filter similar phrases. Suppose we want to find the phrases "quick-brown-fox", "quick brown fox", "quickbrownfox" or "quick brownfox". To do so, we need to create the following four patterns:
p1 = [{'LOWER': 'quickbrownfox'}]
p2 = [{'LOWER': 'quick'}, {'IS_PUNCT': True}, {'LOWER': 'brown'}, {'IS_PUNCT': True}, {'LOWER': 'fox'}]
p3 = [{'LOWER': 'quick'}, {'LOWER': 'brown'}, {'LOWER': 'fox'}]
p4 = [{'LOWER': 'quick'}, {'LOWER': 'brownfox'}]
In the above script,
- p1 looks for the phrase "quickbrownfox"
- p2 looks for the phrase "quick-brown-fox"
- p3 tries to search for "qucik brown fox"
- p4 looks for the phrase "quick brownfox"
The token attribute LOWER
defines that the phrase should be converted into lower case before matching.
Once the patterns are defined, we need to add them to the Matcher
object that we created earlier.
m_tool.add('QBF', None, p1, p2, p3, p4)
Here "QBF" is the name of our matcher. You can give it any name.
Applying Matcher to the Document
We have our matcher
ready. The next step is to apply the matcher on a text document and see if we can get any match. Let's first create a simple document:
sentence = nlp(u'The quick-brown-fox jumps over the lazy dog. The quick brown fox eats well. \
the quickbrownfox is dead. the dog misses the quick brownfox')
To apply the matcher to a document. The document is needed to be passed as a parameter to the matcher object. The result will be all the ids of the phrases matched in the document, along with their starting and ending positions in the document. Execute the following script:
phrase_matches = m_tool(sentence)
print(phrase_matches )
The output of the script above looks like this:
[(12825528024649263697, 1, 6), (12825528024649263697, 13, 16), (12825528024649263697, 21, 22), (12825528024649263697, 29, 31)]
From the output, you can see that four phrases have been matched. The first long number in each output is the id of the phrase matched, the second and third numbers are the starting and ending positions of the phrase.
To actually view the result in a better way, we can iterate through each matched phrase and display its string value. Execute the following script:
for match_id, start, end in phrase_matches:
string_id = nlp.vocab.strings[match_id]
span = sentence[start:end]
print(match_id, string_id, start, end, span.text)
Output:
12825528024649263697 QBF 1 6 quick-brown-fox
12825528024649263697 QBF 13 16 quick brown fox
12825528024649263697 QBF 21 22 quickbrownfox
12825528024649263697 QBF 29 31 quick brownfox
From the output, you can see all the matched phrases along with their vocabulary ids and start and ending position.
More Options for Rule-Based Matching
Official documentation from the sPacy library contains details of all the tokens and wildcards that can be used for phrase matching.
For instance, the "*" attribute is defined to search for one or more instances of the token.
Let's write a simple pattern that can identify the phrase "quick--brown--fox" or quick-brown---fox.
Let's first remove the previous matcher QBF
.
m_tool.remove('QBF')
Next, we need to define our new pattern:
p1 = [{'LOWER': 'quick'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'brown'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'fox'}]
m_tool.add('QBF', None, p1)
The pattern p1
will match all the phrases where there are one or more punctuations in the phrase quick brown fox
. Let's now define our document for filtering:
sentence = nlp(u'The quick--brown--fox jumps over the quick-brown---fox')
You can see our document has two phrases quick--brown--fox and quick-brown---fox, that should match our pattern. Let's apply our matcher to the document and see the results:
phrase_matches = m_tool(sentence)
for match_id, start, end in phrase_matches:
string_id = nlp.vocab.strings[match_id]
span = sentence[start:end]
print(match_id, string_id, start, end, span.text)
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
The output of the script above looks like this:
12825528024649263697 QBF 1 6 quick--brown--fox
12825528024649263697 QBF 10 15 quick-brown---fox
From the output, you can see that our matcher has successfully matched the two phrases.
Phrase-Based Matching
In the last section, we saw how we can define rules that can be used to identify phrases from the document. In addition to defining rules, we can directly specify the phrases that we are looking for.
This is a more efficient way of phrase matching.
In this section, we will be doing phrase matching inside a Wikipedia article on Artificial intelligence.
Before we see the steps to perform phrase-matching, let's first parse the Wikipedia article that we will be using to perform phrase matching. Execute the following script:
import bs4 as bs
import urllib.request
import re
import nltk
scrapped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
article = scrapped_data .read()
parsed_article = bs.BeautifulSoup(article,'lxml')
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
article_text += p.text
processed_article = article_text.lower()
processed_article = re.sub('[^a-zA-Z]', ' ', processed_article )
processed_article = re.sub(r'\s+', ' ', processed_article)
The script has been explained in detail in my article on Implementing Word2Vec with gensim
Library in Python. You can go and read the article if you want to understand how parsing works in Python.
The processed_article
contains the document that we will use for phrase-matching.
The steps to perform phrase matching are quite similar to rule based matching.
Create Phrase Matcher Object
As a first step, you need to create a PhraseMatcher
object. The following script does that:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import PhraseMatcher
phrase_matcher = PhraseMatcher(nlp.vocab)
Notice in the previous section we created a Matcher
object. Here, in this case, we are creating a PhraseMathcer
object.
Create Phrase List
In the second step, you need to create a list of phrases to match and then convert the list to spaCy NLP documents as shown in the following script:
phrases = ['machine learning', 'robots', 'intelligent agents']
patterns = [nlp(text) for text in phrases]
Finally, you need to add your phrase list to the phrase matcher.
phrase_matcher.add('AI', None, *patterns)
Here the name of our matcher is AI.
Applying Matcher to the Document
Like rule-based matching, we again need to apply our phrase matcher to the document. However, our parsed article is not in spaCy document format. Therefore, we will convert our article into sPacy document format and will then apply our phrase matcher to the article.
sentence = nlp (processed_article)
matched_phrases = phrase_matcher(sentence)
In the output, we will have all the ids of all the matched phrases along with their start and end indexes in the document as shown below:
[(5530044837203964789, 37, 39),
(5530044837203964789, 402, 404),
(5530044837203964789, 693, 694),
(5530044837203964789, 1284, 1286),
(5530044837203964789, 3059, 3061),
(5530044837203964789, 3218, 3220),
(5530044837203964789, 3753, 3754),
(5530044837203964789, 5212, 5213),
(5530044837203964789, 5287, 5288),
(5530044837203964789, 6769, 6771),
(5530044837203964789, 6781, 6783),
(5530044837203964789, 7496, 7498),
(5530044837203964789, 7635, 7637),
(5530044837203964789, 8002, 8004),
(5530044837203964789, 9461, 9462),
(5530044837203964789, 9955, 9957),
(5530044837203964789, 10784, 10785),
(5530044837203964789, 11250, 11251),
(5530044837203964789, 12290, 12291),
(5530044837203964789, 12411, 12412),
(5530044837203964789, 12455, 12456)]
To see the string value of the matched phrases, execute the following script:
for match_id, start, end in matched_phrases:
string_id = nlp.vocab.strings[match_id]
span = sentence[start:end]
print(match_id, string_id, start, end, span.text)
In the output, you will see the string value of the matched phrases as shown below:
5530044837203964789 AI 37 39 intelligent agents
5530044837203964789 AI 402 404 machine learning
5530044837203964789 AI 693 694 robots
5530044837203964789 AI 1284 1286 machine learning
5530044837203964789 AI 3059 3061 intelligent agents
5530044837203964789 AI 3218 3220 machine learning
5530044837203964789 AI 3753 3754 robots
5530044837203964789 AI 5212 5213 robots
5530044837203964789 AI 5287 5288 robots
5530044837203964789 AI 6769 6771 machine learning
5530044837203964789 AI 6781 6783 machine learning
5530044837203964789 AI 7496 7498 machine learning
5530044837203964789 AI 7635 7637 machine learning
5530044837203964789 AI 8002 8004 machine learning
5530044837203964789 AI 9461 9462 robots
5530044837203964789 AI 9955 9957 machine learning
5530044837203964789 AI 10784 10785 robots
5530044837203964789 AI 11250 11251 robots
5530044837203964789 AI 12290 12291 robots
5530044837203964789 AI 12411 12412 robots
5530044837203964789 AI 12455 12456 robots
From the output, you can see all the three phrases that we tried to search along with their start and end index and the string ids.
Stop Words
Before we conclude this article, I just wanted to touch on the concept of stop words. Stop words are English words such as "the", "a", "an" etc that do not have any meaning of their own. Stop words are often not very useful for NLP tasks such as text classification or language modeling. So it is often better to remove these stop words before further processing of the document.
The spaCy library contains 305 stop words. In addition, depending upon our requirements, we can also add or remove stop words from the spaCy library.
To see the default spaCy stop words, we can use stop_words
attribute of the spaCy model as shown below:
import spacy
sp = spacy.load('en_core_web_sm')
print(sp.Defaults.stop_words)
In the output, you will see all the sPacy stop words:
{'less', 'except', 'top', 'me', 'three', 'fifteen', 'a', 'is', 'those', 'all', 'then', 'everyone', 'without', 'must', 'has', 'any', 'anyhow', 'keep', 'through', 'bottom', 'get', 'indeed', 'it', 'still', 'ten', 'whatever', 'doing', 'though', 'eight', 'various', 'myself', 'across', 'wherever', 'himself', 'always', 'thus', 'am', 'after', 'should', 'perhaps', 'at', 'down', 'own', 'rather', 'regarding', 'which', 'anywhere', 'whence', 'would', 'been', 'how', 'herself', 'now', 'might', 'please', 'behind', 'every', 'seems', 'alone', 'from', 'via', 'its', 'become', 'hers', 'there', 'front', 'whose', 'before', 'against', 'whereafter', 'up', 'whither', 'two', 'five', 'eleven', 'why', 'below', 'out', 'whereas', 'serious', 'six', 'give', 'also', 'became', 'his', 'anyway', 'none', 'again', 'onto', 'else', 'have', 'few', 'thereby', 'whoever', 'yet', 'part', 'just', 'afterwards', 'mostly', 'see', 'hereby', 'not', 'can', 'once', 'therefore', 'together', 'whom', 'elsewhere', 'beforehand', 'themselves', 'with', 'seem', 'many', 'upon', 'former', 'are', 'who', 'becoming', 'formerly', 'between', 'cannot', 'him', 'that', 'first', 'more', 'although', 'whenever', 'under', 'whereby', 'my', 'whereupon', 'anyone', 'toward', 'by', 'four', 'since', 'amongst', 'move', 'each', 'forty', 'somehow', 'as', 'besides', 'used', 'if', 'name', 'when', 'ever', 'however', 'otherwise', 'hundred', 'moreover', 'your', 'sometimes', 'the', 'empty', 'another', 'where', 'her', 'enough', 'quite', 'throughout', 'anything', 'she', 'and', 'does', 'above', 'within', 'show', 'in', 'this', 'back', 'made', 'nobody', 'off', 're', 'meanwhile', 'than', 'neither', 'twenty', 'call', 'you', 'next', 'thereupon', 'therein', 'go', 'or', 'seemed', 'such', 'latterly', 'already', 'mine', 'yourself', 'an', 'amount', 'hereupon', 'namely', 'same', 'their', 'of', 'yours', 'could', 'be', 'done', 'whole', 'seeming', 'someone', 'these', 'towards', 'among', 'becomes', 'per', 'thru', 'beyond', 'beside', 'both', 'latter', 'ours', 'well', 'make', 'nowhere', 'about', 'were', 'others', 'due', 'yourselves', 'unless', 'thereafter', 'even', 'too', 'most', 'everything', 'our', 'something', 'did', 'using', 'full', 'while', 'will', 'only', 'nor', 'often', 'side', 'being', 'least', 'over', 'some', 'along', 'was', 'very', 'on', 'into', 'nine', 'noone', 'several', 'i', 'one', 'third', 'herein', 'but', 'further', 'here', 'whether', 'because', 'either', 'hereafter', 'really', 'so', 'somewhere', 'we', 'nevertheless', 'last', 'had', 'they', 'thence', 'almost', 'ca', 'everywhere', 'itself', 'no', 'ourselves', 'may', 'wherein', 'take', 'around', 'never', 'them', 'to', 'until', 'do', 'what', 'say', 'twelve', 'nothing', 'during', 'sixty', 'sometime', 'us', 'fifty', 'much', 'for', 'other', 'hence', 'he', 'put'}
You can also check if a word is a stop word or not. To do so, you can use the is_stop
attribute as shown below:
sp.vocab['wonder'].is_stop
Since "wonder" is not a spaCy stop word, you will see False
in the output.
To add or remove stop words in spaCy, you can use sp.Defaults.stop_words.add()
and sp.Defaults.stop_words.remove()
methods respectively.
sp.Defaults.stop_words.add('wonder')
Next, we need to set the is_stop
tag for wonder
to 'True` as shown below:
sp.vocab['wonder'].is_stop = True
Conclusion
Phrase and vocabulary matching is one of the most important natural language processing tasks. In this article, we continued our discussion about how to use Python to perform rule-based and phrase based matching. In addition, we also saw spaCy stop words.
In the next article, we will see parts of speech tagging and named entity recognition in detail.