Python for NLP: Tokenization, Stemming, and Lemmatization with SpaCy Library

In the previous article, we started our discussion about how to do natural language processing with Python. We saw how to read and write text and PDF files. In this article, we will start working with the spaCy library to perform a few more basic NLP tasks such as tokenization, stemming and lemmatization.

Introduction to SpaCy

The spaCy library is one of the most popular NLP libraries along with NLTK. The basic difference between the two libraries is the fact that NLTK contains a wide variety of algorithms to solve one problem whereas spaCy contains only one, but the best algorithm to solve a problem.

NLTK was released back in 2001 while spaCy is relatively new and was developed in 2015. In this series of articles on NLP, we will mostly be dealing with spaCy, owing to its state of the art nature. However, we will also touch NLTK when it is easier to perform a task using NLTK rather than spaCy.

Installing spaCy

If you use the pip installer to install your Python libraries, go to the command line and execute the following statement:

$ pip install -U spacy

Otherwise if you are using Anaconda, you need to execute the following command on the Anaconda prompt:

$ conda install -c conda-forge spacy

Once you download and install spaCy, the next step is to download the language model. We will be using the English language model. The language model is used to perform a variety of NLP tasks, which we will see in a later section.

The following command downloads the language model:

$ python -m spacy download en

Basic Functionality

Before we dive deeper into different spaCy functions, let's briefly see how to work with it.

As a first step, you need to import the spacy library as follows:

import spacy

Next, we need to load the spaCy language model.

sp = spacy.load('en_core_web_sm')

In the script above we use the load function from the spacy library to load the core English language model. The model is stored in the sp variable.

Let's now create a small document using this model. A document can be a sentence or a group of sentences and can have unlimited length. The following script creates a simple spaCy document.

sentence = sp(u'Manchester United is looking to sign a forward for $90 million')

SpaCy automatically breaks your document into tokens when a document is created using the model.

A token simply refers to an individual part of a sentence having some semantic value. Let's see what tokens we have in our document:

for word in sentence:
    print(word.text)

The output of the script above looks like this:

Manchester
United
is
looking
to
sign
a
forward
for
$
90
million

You can see we have the following tokens in our document. We can also see the parts of speech of each of these tokens using the .pos_ attribute shown below:

for word in sentence:
    print(word.text,  word.pos_)

Output:

Manchester PROPN
United PROPN
is VERB
looking VERB
to PART
sign VERB
a DET
forward NOUN
for ADP
$ SYM
90 NUM
million NUM

You can see that each word or token in our sentence has been assigned a part of speech. For instance "Manchester" has been tagged as a proper noun, "Looking" has been tagged as a verb, and so on.

Finally, in addition to the parts of speech, we can also see the dependencies.

Let's create another document:

sentence2 = sp(u"Manchester United isn't looking to sign any forward.")

For dependency parsing, the attribute dep_ is used as shown below:

for word in sentence2:
    print(word.text,  word.pos_, word.dep_)

The output looks like this:

Manchester PROPN compound
United PROPN nsubj
is VERB aux
n't ADV neg
looking VERB ROOT
to PART aux
sign VERB xcomp
any DET advmod
forward ADV advmod
. PUNCT punct

From the output, you can see that spaCy is intelligent enough to find the dependency between the tokens, for instance in the sentence we had a word is'nt. The depenency parser has broken it down to two words and specifies that the n't is actually negation of the previous word.

For a detailed understanding of dependency parsing, refer to this article.

In addition to printing the words, you can also print sentences from a document.

document = sp(u'Hello from Stackabuse. The site with the best Python Tutorials. What are you looking for?')

Now, we can iterate through each sentence using the following script:

for sentence in document.sents:
    print(sentence)

The output of the script looks like this:

Hello from Stackabuse.
The site with the best Python Tutorials.
What are you looking for?

You can also check if a sentence starts with a particular token or not. You can get individual tokens using an index and the square brackets, like an array:

document[4]

In the above script, we are searching for the 5th word in the document. Keep in mind that the index start from zero, and the period counts as a token. In the output you should see:

The

Now to see if any sentence in the document starts with The, we can use the is_sent_start attribute as shown below:

document[4].is_sent_start

In the output, you will see True since the token The is used at the start of the second sentence.

In this section, we saw a few basic operations of the spaCy library. Let's now dig deeper and see Tokenization, Stemming, and Lemmatization in detail.

Tokenization

As explained earlier, tokenization is the process of breaking a document down into words, punctuation marks, numeric digits, etc.

Let's see spaCy tokenization in detail. Create a new document using the following script:

sentence3 = sp(u'"They\'re leaving U.K. for U.S.A."')
print(sentence3)

You can see the sentence contains quotes at the beginnnig and at the end. It also contains punctuation marks in abbreviations "U.K" and "U.S.A."

Let's see how spaCy tokenizes this sentence.

for word in sentence3:
    print(word.text)

Output:

"
They
're
leaving
U.K.
for
U.S.A.
"

In the output, you can see that spaCy has tokenized the starting and ending double quotes. However, it is intelligent enough, not to tokenize the punctuation dot used between the abbreviations such as U.K. and U.S.A.

Let's see another tokenization example:

sentence4 = sp(u"Hello, I am non-vegetarian, email me the menu at [email protected]")
print(sentence4)

Here in the above sentence, we have a dash in the word "non-vegetarian" and in the email address. Let's see how spaCy will tokenize this:

for word in sentence4:
    print(word.text)

Output:

Hello
,
I
am
non
-
vegetarian
,
email
me
the
menu
at
[email protected]

It is evident from the output that spaCy was actually able to detect the email and it did not tokenize it despite having a "-". On the other hand, the word "non-vegetarian" was tokenized.

Let's now see how we can count the words in the document:

len(sentence4)

In the output, you will see 14, which is the number of tokens in the sentence4.

Detecting Entities

In addition to tokenizing the documents to words, you can also find if the word is an entity such as a company, place, building, currency, institution, etc.

Let's see a simple example of named entity recognition:

sentence5 = sp(u'Manchester United is looking to sign Harry Kane for $90 million')  

Let's first try to simply tokenize it:

for word in sentence5:
    print(word.text)

Output:

Manchester
United
is
looking
to
sign
Harry
Kane
for
$
90
million

We know that "Manchester United" is a single word, therefore it should not be tokenized into two words. Similarly, "Harry Kane" is the name of a person, and "$90 million" is a currency value. These should not be tokenized either.

This is where named entity recognition comes to play. To get the named entities from a document, you have to use the ents attribute. Let's retrieve the named entities from the above sentence. Execute the following script:

for entity in sentence.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

In the above script, we print the text of the entity, the label of the entity and the detail of the entity. The output looks like this:

Output:

Manchester United - ORG - Companies, agencies, institutions, etc.
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit

You can see that spaCy's named entity recognizer has successfully recognized "Manchester United" as an organization, "Harry Kane" as a person and "$90 million" as a currency value.

Detecting Nouns

In addition to detecting named entities, nouns can also be detected. To do so, the noun_chunks attribute is used. Consider the following sentence:

sentence5 = sp(u'Latest Rumours: Manchester United is looking to sign Harry Kane for $90 million')  

Let's try to find the nouns from this sentence:

for noun in sentence5.noun_chunks:
    print(noun.text)

Output:

Latest Rumours
Manchester United
Harry Kane

From the output, you can see that a noun can be a named entity as well and vice versa.

Stemming

Stemming refers to reducing a word to its root form. While performing natural language processing tasks, you will encounter various scenarios where you find different words with the same root. For instance, compute, computer, computing, computed, etc. You may want to reduce the words to their root form for the sake of uniformity. This is where stemming comes in to play.

It might be surprising to you but spaCy doesn't contain any function for stemming as it relies on lemmatization only. Therefore, in this section, we will use NLTK for stemming.

There are two types of stemmers in NLTK: Porter Stemmer and Snowball stemmers. Both of them have been implemented using different algorithms.

Porter Stemmer

Let's see porter stemmer in action:

import nltk

from nltk.stem.porter import *

Let's create a class of PorterStemmer.

stemmer = PorterStemmer()

Suppose we have the following list and we want to reduce these words to stem:

tokens = ['compute', 'computer', 'computed', 'computing']

The following script finds the stem for the words in the list using porter stemmer:

for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

The output is as follows:

compute --> comput
computer --> comput
computed --> comput
computing --> comput

You can see that all the 4 words have been reduced to "comput" which actually isn't a word at all.

Snowball Stemmer

Snowball stemmer is a slightly improved version of the Porter stemmer and is usually preferred over the latter. Let's see snowball stemmer in action:

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language='english')

tokens = ['compute', 'computer', 'computed', 'computing']

for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

In the script above, we used Snowball stemmer to find the stem of the same 4 words that we used with porter stemmer. The output looks like this:

compute --> comput
computer --> comput
computed --> comput
computing --> comput

You can see that the results are the same. We still got "comput" as the stem. Again, this word "comput" actually isn't a dictionary word.

This is where lemmatization comes handy. Lemmatization reduces the word to its stem as it appears in the dictionary. The stems returned through lemmatization are actual dictionary words and are semantically complete unlike the words returned by stemmer.

Lemmatization

Though we could not perform stemming with spaCy, we can perform lemmatization using spaCy.

To do so, we need to use the lemma_ attribute on the spaCy document. Suppose we have the following sentence:

sentence6 = sp(u'compute computer computed computing')

We can find the roots of all the words using spaCy lemmatization as follows:

for word in sentence6:
    print(word.text,  word.lemma_)

The output of the script above looks like this:

compute compute
computer computer
computed compute
computing computing

You can see that unlike stemming where the root we got was "comput", the roots that we got here are actual words in the dictionary.

Lemmatization converts words in the second or third forms to their first form variants. Look at the following example:

sentence7 = sp(u'A letter has been written, asking him to be released')

for word in sentence7:
    print(word.text + '  ===>', word.lemma_)

Output:

A ===> a
letter ===> letter
has ===> have
been ===> be
written ===> write
, ===> ,
asking ===> ask
him ===> -PRON-
to ===> to
be ===> be
released ===> release

You can clearly see from the output that the words in second and third forms, such as "written", "released", etc. have been converted to the first form i.e. "write" and "release".

Conclusion

Tokenization, Stemming and Lemmatization are some of the most fundamental natural language processing tasks. In this article, we saw how we can perform Tokenization and Lemmatization using the spaCy library. We also saw how NLTK can be used for stemming. In the next article, we will start our discussion about Vocabulary and Phrase Matching in Python.

Author image
About Usman Malik
Paris (France) Twitter
Programmer | Blogger | Data Science Enthusiast | PhD To Be | Arsenal FC for Life