This is the 4th article in my series of articles on Python for NLP. In my previous article, I explained how the spaCy library can be used to perform tasks like vocabulary and phrase matching.
In this article, we will study parts of speech tagging and named entity recognition in detail. We will see how the spaCy library can be used to perform these two tasks.
Parts of Speech (POS) Tagging
Parts of speech tagging simply refers to assigning parts of speech to individual words in a sentence, which means that, unlike phrase matching, which is performed at the sentence or multi-word level, parts of speech tagging is performed at the token level.
Let's take a very simple example of parts of speech tagging.
import spacy
sp = spacy.load('en_core_web_sm')
As usual, in the script above we import the core spaCy English model. Next, we need to create a spaCy document that we will be using to perform parts of speech tagging.
sen = sp(u"I like to play football. I hated it in my childhood though")
The spaCy document object has several attributes that can be used to perform a variety of tasks. For instance, to print the text of the document, the text
attribute is used. Similarly, the pos_
attribute returns the coarse-grained POS tag. To obtain fine-grained POS tags, we could use the tag_
attribute. And finally, to get the explanation of a tag, we can use the spacy.explain()
method and pass it the tag name.
Let's see this in action:
print(sen.text)
The above script simply prints the text of the sentence. The output looks like this:
I like to play football. I hated it in my childhood though
Next, let's see pos_
attribute. We will print the POS tag of the word "hated", which is actually the seventh token in the sentence.
print(sen[7].pos_)
Output:
VERB
You can see that POS tag returned for "hated" is a "VERB" since "hated" is a verb.
Now let's print the fine-grained POS tag for the word "hated".
print(sen[7].tag_)
Output:
VBD
To see what VBD means, we can use spacy.explain()
method as shown below:
print(spacy.explain(sen[7].tag_))
Output:
verb, past tense
The output shows that VBD is a verb in the past tense.
Let's print the text, coarse-grained POS tags, fine-grained POS tags, and the explanation for the tags for all the words in the sentence.
for word in sen:
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')
In the script above we improve the readability and formatting by adding 12 spaces between the text and coarse-grained POS tag and then another 10 spaces between the coarse-grained POS tags and fine-grained POS tags.
Output:
I PRON PRP pronoun, personal
like VERB VBP verb, non-3rd person singular present
to PART TO infinitival to
play VERB VB verb, base form
football NOUN NN noun, singular or mass
. PUNCT . punctuation mark, sentence closer
I PRON PRP pronoun, personal
hated VERB VBD verb, past tense
it PRON PRP pronoun, personal
in ADP IN conjunction, subordinating or preposition
my ADJ PRP$ pronoun, possessive
childhood NOUN NN noun, singular or mass
though ADP IN conjunction, subordinating or preposition
A complete tag list for the parts of speech and the fine-grained tags, along with their explanation, is available at spaCy official documentation.
Why POS Tagging is Useful?
POS tagging can be really useful, particularly if you have words or tokens that can have multiple POS tags. For instance, the word "google" can be used as both a noun and verb, depending upon the context. While processing natural language, it is important to identify this difference. Fortunately, the spaCy library comes pre-built with machine learning algorithms that, depending upon the context (surrounding words), it is capable of returning the correct POS tag for the word.
Let's see this in action. Execute the following script:
sen = sp(u'Can you google it?')
word = sen[2]
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')
In the script above we create spaCy document with the text "Can you google it?" Here the word "google" is being used as a verb. Next, we print the POS tag for the word "google" along with the explanation of the tag. The output looks like this:
google VERB VB verb, base form
From the output, you can see that the word "google" has been correctly identified as a verb.
Let's now see another example:
sen = sp(u'Can you search it on google?')
word = sen[5]
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')
Here in the above script the word "google" is being used as a noun as shown by the output:
google PROPN NNP noun, proper singular
Finding the Number of POS Tags
You can find the number of occurrences of each POS tag by calling the count_by
on the spaCy document object. The method takes spacy.attrs.POS
as a parameter value.
sen = sp(u"I like to play football. I hated it in my childhood though")
num_pos = sen.count_by(spacy.attrs.POS)
num_pos
Output:
{96: 1, 99: 3, 84: 2, 83: 1, 91: 2, 93: 1, 94: 3}
In the output, you can see the ID of the POS tags along with their frequencies of occurrence. The text of the POS tag can be displayed by passing the ID of the tag to the vocabulary of the actual spaCy document.
for k,v in sorted(num_pos.items()):
print(f'{k}. {sen.vocab[k].text:{8}}: {v}')
Now in the output, you will see the ID, the text, and the frequency of each tag as shown below:
83. ADJ : 1
84. ADP : 2
91. NOUN : 2
93. PART : 1
94. PRON : 3
96. PUNCT : 1
99. VERB : 3
Visualizing Parts of Speech Tags
Visualizing POS tags in a graphical way is extremely easy. The displacy
module from the spacy
library is used for this purpose. To visualize the POS tags inside the Jupyter notebook, you need to call the render
method from the displacy
module and pass it the spacy document, the style of the visualization, and set the jupyter
attribute to True
as shown below:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
from spacy import displacy
sen = sp(u"I like to play football. I hated it in my childhood though")
displacy.render(sen, style='dep', jupyter=True, options={'distance': 85})
In the output, you should see the following dependency tree for POS tags.
You can clearly see the dependency of each token on another along with the POS tag.
If you want to visualize the POS tags outside the Jupyter notebook, then you need to call the serve
method. The plot for POS tags will be printed in the HTML form inside your default browser. Execute the following script:
displacy.serve(sen, style='dep', options={'distance': 120})
Once you execute the above script, you will see the following message:
Serving on port 5000...
Using the 'dep' visualizer
To view the dependency tree, type the following address in your browser: http://127.0.0.1:5000/. You will see the following dependency tree:
Named Entity Recognition
Named entity recognition refers to the identification of words in a sentence as an entity e.g. the name of a person, place, organization, etc. Let's see how the spaCy library performs named entity recognition. Look at the following script:
import spacy
sp = spacy.load('en_core_web_sm')
sen = sp(u'Manchester United is looking to sign Harry Kane for $90 million')
In the script above we created a simple spaCy document with some text. To find the named entity we can use the ents
attribute, which returns the list of all the named entities in the document.
print(sen.ents)
Output:
(Manchester United, Harry Kane, $90 million)
You can see that three named entities were identified. To see the detail of each named entity, you can use the text
, label
, and the spacy.explain
method which takes the entity object as a parameter.
for entity in sen.ents:
print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))
In the output, you will see the name of the entity along with the entity type and a small description of the entity as shown below:
Manchester United - ORG - Companies, agencies, institutions, etc.
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit
You can see that "Manchester United" has been correctly identified as an organization, company, etc. Similarly, "Harry Kane" has been identified as a person and finally, "$90 million" has been correctly identified as an entity of type Money.
Adding New Entities
You can also add new entities to an existing document. For instance in the following example, "Nesfruita" is not identified as a company by the spaCy library.
sen = sp(u'Nesfruita is setting up a new company in India')
for entity in sen.ents:
print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))
Output:
India - GPE - Countries, cities, states
From the output, you can see that only India has been identified as an entity.
Now to add "Nesfruita" as an entity of type "ORG" to our document, we need to execute the following steps:
from spacy.tokens import Span
ORG = sen.vocab.strings[u'ORG']
new_entity = Span(sen, 0, 1, label=ORG)
sen.ents = list(sen.ents) + [new_entity]
First, we need to import the Span
class from the spacy.tokens
module. Next, we need to get the hash value of the ORG
entity type from our document. After that, we need to assign the hash value of ORG
to the span. Since "Nesfruita" is the first word in the document, the span is 0-1. Finally, we need to add the new entity span to the list of entities. Now if you execute the following script, you will see "Nesfruita" in the list of entities.
for entity in sen.ents:
print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))
The output of the script above looks like this:
Nesfruita - ORG - Companies, agencies, institutions, etc.
India - GPE - Countries, cities, states
Counting Entities
In the case of POS tags, we could count the frequency of each POS tag in a document using a special method sen.count_by
. However, for named entities, no such method exists. We can manually count the frequency of each entity type. Suppose we have the following document along with its entities:
sen = sp(u'Manchester United is looking to sign Harry Kane for $90 million. David demand 100 Million Dollars')
for entity in sen.ents:
print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))
Output:
Manchester United - ORG - Companies, agencies, institutions, etc.
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit
David - PERSON - People, including fictional
100 Million Dollars - MONEY - Monetary values, including unit
To count the person type entities in the above document, we can use the following script:
len([ent for ent in sen.ents if ent.label_=='PERSON'])
In the output, you will see 2 since there are 2 entities of type PERSON in the document.
Visualizing Named Entities
Like the POS tags, we can also view named entities inside the Jupyter notebook as well as in the browser.
To do so, we will again use the displacy
object. Look at the following example:
from spacy import displacy
sen = sp(u'Manchester United is looking to sign Harry Kane for $90 million. David demand 100 Million Dollars')
displacy.render(sen, style='ent', jupyter=True)
You can see that the only difference between visualizing named entities and POS tags is that here in case of named entities we passed ent
as the value for the style
parameter. The output of the script above looks like this:
You can see from the output that the named entities have been highlighted in different colors along with their entity types.
You can also filter which entity types to display. To do so, you need to pass the type of the entities to display in a list, which is then passed as a value to the ents
key of a dictionary. The dictionary is then passed to the options
parameter of the render
method of the displacy
module as shown below:
filter = {'ents': ['ORG']}
displacy.render(sen, style='ent', jupyter=True, options=filter)
In the script above, we specified that only the entities of type ORG should be displayed in the output. The output of the script above looks like this:
Finally, you can also display named entities outside the Jupyter notebook. The following script will display the named entities in your default browser. Execute the following script:
displacy.serve(sen, style='ent')
Now if you go to the address http://127.0.0.1:5000/ in your browser, you should see the named entities.
Conclusion
Parts of speech tagging and named entity recognition are crucial to the success of any NLP task. In this article, we saw how Python's spaCy library can be used to perform POS tagging and named entity recognition with the help of different examples.