Python for NLP: Parts of Speech Tagging and Named Entity Recognition

This is the 4th article in my series of articles on Python for NLP. In my previous article, I explained how the spaCy library can be used to perform tasks like vocabulary and phrase matching.

In this article, we will study parts of speech tagging and named entity recognition in detail. We will see how the spaCy library can be used to perform these two tasks.

Parts of Speech (POS) Tagging

Parts of speech tagging simply refers to assigning parts of speech to individual words in a sentence, which means that, unlike phrase matching, which is performed at the sentence or multi-word level, parts of speech tagging is performed at the token level.

Let's take a very simple example of parts of speech tagging.

import spacy  
sp = spacy.load('en_core_web_sm')  

As usual, in the script above we import the core spaCy English model. Next, we need to create a spaCy document that we will be using to perform parts of speech tagging.

sen = sp(u"I like to play football. I hated it in my childhood though")  

The spaCy document object has several attributes that can be used to perform a variety of tasks. For instance, to print the text of the document, the text attribute is used. Similarly, the pos_ attribute returns the coarse-grained POS tag. To obtain fine-grained POS tags, we could use the tag_ attribute. And finally, to get the explanation of a tag, we can use the spacy.explain() method and pass it the tag name.

Let's see this in action:

print(sen.text)  

The above script simply prints the text of the sentence. The output looks like this:

I like to play football. I hated it in my childhood though  

Next, let's see pos_ attribute. We will print the POS tag of the word "hated", which is actually the seventh token in the sentence.

print(sen[7].pos_)  

Output:

VERB  

You can see that POS tag returned for "hated" is a "VERB" since "hated" is a verb.

Now let's print the fine-grained POS tag for the word "hated".

print(sen[7].tag_)  

Output:

VBD  

To see what VBD means, we can use spacy.explain() method as shown below:

print(spacy.explain(sen[7].tag_))  

Output:

verb, past tense  

The output shows that VBD is a verb in the past tense.

Let's print the text, coarse-grained POS tags, fine-grained POS tags, and the explanation for the tags for all the words in the sentence.

for word in sen:  
    print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

In the script above we improve the readability and formatting by adding 12 spaces between the text and coarse-grained POS tag and then another 10 spaces between the coarse-grained POS tags and fine-grained POS tags.

Output:

I            PRON       PRP      pronoun, personal  
like         VERB       VBP      verb, non-3rd person singular present  
to           PART       TO       infinitival to  
play         VERB       VB       verb, base form  
football     NOUN       NN       noun, singular or mass  
.            PUNCT      .        punctuation mark, sentence closer
I            PRON       PRP      pronoun, personal  
hated        VERB       VBD      verb, past tense  
it           PRON       PRP      pronoun, personal  
in           ADP        IN       conjunction, subordinating or preposition  
my           ADJ        PRP$     pronoun, possessive  
childhood    NOUN       NN       noun, singular or mass  
though       ADP        IN       conjunction, subordinating or preposition  

A complete tag list for the parts of speech and the fine-grained tags, along with their explanation, is available at spaCy official documentation.

Why POS Tagging is Useful?

POS tagging can be really useful, particularly if you have words or tokens that can have multiple POS tags. For instance, the word "google" can be used as both a noun and verb, depending upon the context. While processing natural language, it is important to identify this difference. Fortunately, the spaCy library comes pre-built with machine learning algorithms that, depending upon the context (surrounding words), it is capable of returning the correct POS tag for the word.

Let's see this in action. Execute the following script:

sen = sp(u'Can you google it?')  
word = sen[2]

print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')  

In the script above we create spaCy document with the text "Can you google it?" Here the word "google" is being used as a verb. Next, we print the POS tag for the word "google" along with the explanation of the tag. The output looks like this:

google       VERB       VB       verb, base form  

From the output, you can see that the word "google" has been correctly identified as a verb.

Let's now see another example:

sen = sp(u'Can you search it on google?')  
word = sen[5]

print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')  

Here in the above script the word "google" is being used as a noun as shown by the output:

google       PROPN      NNP      noun, proper singular  

Finding the Number of POS Tags

You can find the number of occurrences of each POS tag by calling the count_by on the spaCy document object. The method takes spacy.attrs.POS as a parameter value.

sen = sp(u"I like to play football. I hated it in my childhood though")

num_pos = sen.count_by(spacy.attrs.POS)  
num_pos  

Output:

{96: 1, 99: 3, 84: 2, 83: 1, 91: 2, 93: 1, 94: 3}

In the output, you can see the ID of the POS tags along with their frequencies of occurrence. The text of the POS tag can be displayed by passing the ID of the tag to the vocabulary of the actual spaCy document.

for k,v in sorted(num_pos.items()):  
    print(f'{k}. {sen.vocab[k].text:{8}}: {v}')

Now in the output, you will see the ID, the text, and the frequency of each tag as shown below:

83. ADJ     : 1  
84. ADP     : 2  
91. NOUN    : 2  
93. PART    : 1  
94. PRON    : 3  
96. PUNCT   : 1  
99. VERB    : 3  

Visualizing Parts of Speech Tags

Visualizing POS tags in a graphical way is extremely easy. The displacy module from the spacy library is used for this purpose. To visualize the POS tags inside the Jupyter notebook, you need to call the render method from the displacy module and pass it the spacy document, the style of the visualization, and set the jupyter attribute to True as shown below:

from spacy import displacy

sen = sp(u"I like to play football. I hated it in my childhood though")  
displacy.render(sen, style='dep', jupyter=True, options={'distance': 85})  

In the output, you should see the following dependency tree for POS tags.

You can clearly see the dependency of each token on another along with the POS tag.

If you want to visualize the POS tags outside the Jupyter notebook, then you need to call the serve method. The plot for POS tags will be printed in the HTML form inside your default browser. Execute the following script:

displacy.serve(sen, style='dep', options={'distance': 120})  

Once you execute the above script, you will see the following message:

Serving on port 5000...  
Using the 'dep' visualizer  

To view the dependency tree, type the following address in your browser: http://127.0.0.1:5000/. You will see the following dependency tree:

Named Entity Recognition

Named entity recognition refers to the identification of words in a sentence as an entity e.g. the name of a person, place, organization, etc. Let's see how the spaCy library performs named entity recognition. Look at the following script:

import spacy  
sp = spacy.load('en_core_web_sm')

sen = sp(u'Manchester United is looking to sign Harry Kane for $90 million')  

In the script above we created a simple spaCy document with some text. To find the named entity we can use the ents attribute, which returns the list of all the named entities in the document.

print(sen.ents)  

Output:

(Manchester United, Harry Kane, $90 million)

You can see that three named entities were identified. To see the detail of each named entity, you can use the text, label, and the spacy.explain method which takes the entity object as a parameter.

for entity in sen.ents:  
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

In the output, you will see the name of the entity along with the entity type and a small description of the entity as shown below:

Manchester United - ORG - Companies, agencies, institutions, etc.  
Harry Kane - PERSON - People, including fictional  
$90 million - MONEY - Monetary values, including unit

You can see that "Manchester United" has been correctly identified as an organization, company, etc. Similarly, "Harry Kane" has been identified as a person and finally, "$90 million" has been correctly identified as an entity of type Money.

Adding New Entities

You can also add new entities to an existing document. For instance in the following example, "Nesfruita" is not identified as a company by the spaCy library.

sen = sp(u'Nesfruita is setting up a new company in India')  
for entity in sen.ents:  
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Output:

India - GPE - Countries, cities, states  

From the output, you can see that only India has been identified as an entity.

Now to add "Nesfruita" as an entity of type "ORG" to our document, we need to execute the following steps:

from spacy.tokens import Span

ORG = sen.vocab.strings[u'ORG']  
new_entity = Span(sen, 0, 1, label=ORG)  
sen.ents = list(sen.ents) + [new_entity]  

First, we need to import the Span class from the spacy.tokens module. Next, we need to get the hash value of the ORG entity type from our document. After that, we need to assign the hash value of ORG to the span. Since "Nesfruita" is the first word in the document, the span is 0-1. Finally, we need to add the new entity span to the list of entities. Now if you execute the following script, you will see "Nesfruita" in the list of entities.

for entity in sen.ents:  
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

The output of the script above looks like this:

Nesfruita - ORG - Companies, agencies, institutions, etc.  
India - GPE - Countries, cities, states  

Counting Entities

In the case of POS tags, we could count the frequency of each POS tag in a document using a special method sen.count_by. However, for named entities, no such method exists. We can manually count the frequency of each entity type. Suppose we have the following document along with its entities:

sen = sp(u'Manchester United is looking to sign Harry Kane for $90 million. David demand 100 Million Dollars')  
for entity in sen.ents:  
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Output:

Manchester United - ORG - Companies, agencies, institutions, etc.  
Harry Kane - PERSON - People, including fictional  
$90 million - MONEY - Monetary values, including unit
David - PERSON - People, including fictional  
100 Million Dollars - MONEY - Monetary values, including unit  

To count the person type entities in the above document, we can use the following script:

len([ent for ent in sen.ents if ent.label_=='PERSON'])  

In the output, you will see 2 since there are 2 entities of type PERSON in the document.

Visualizing Named Entities

Like the POS tags, we can also view named entities inside the Jupyter notebook as well as in the browser.

To do so, we will again use the displacy object. Look at the following example:

from spacy import displacy

sen = sp(u'Manchester United is looking to sign Harry Kane for $90 million. David demand 100 Million Dollars')  
displacy.render(sen, style='ent', jupyter=True)  

You can see that the only difference between visualizing named entities and POS tags is that here in case of named entities we passed ent as the value for the style parameter. The output of the script above looks like this:

You can see from the output that the named entities have been highlighted in different colors along with their entity types.

You can also filter which entity types to display. To do so, you need to pass the type of the entities to display in a list, which is then passed as a value to the ents key of a dictionary. The dictionary is then passed to the options parameter of the render method of the displacy module as shown below:

filter = {'ents': ['ORG']}  
displacy.render(sen, style='ent', jupyter=True, options=filter)  

In the script above, we specified that only the entities of type ORG should be displayed in the output. The output of the script above looks like this:

Finally, you can also display named entities outside the Jupyter notebook. The following script will display the named entities in your default browser. Execute the following script:

displacy.serve(sen, style='ent')  

Now if you go to the address http://127.0.0.1:5000/ in your browser, you should see the named entities.

Conclusion

Parts of speech tagging and named entity recognition are crucial to the success of any NLP task. In this article, we saw how Python's spaCy library can be used to perform POS tagging and named entity recognition with the help of different examples.

Author image
About Usman Malik
Paris (France) Twitter
Programmer | Blogger | Data Science Enthusiast | PhD To Be | Arsenal FC for Life