Introduction
TextBlob is a package built on top of two other packages, one of them is called Natural Language Toolkit, known mainly in its abbreviated form as NLTK, and the other is Pattern. NLTK is a traditional package used for text processing or Natural Language Processing (NLP), and Pattern is built mainly for web mining.
TextBlob is designed to be easier to learn and manipulate than NLTK, while maintaining the same important NLP tasks such as lemmatization, sentiment analysis, stemming, POS-tagging, noun phrase extraction, classification, translation, and more. You can see a complete list of tasks on the PyPI's TextBlob page.
If you are looking for a practical overview of many NLP tasks that can be executed with TextBlob, take a look at our "Python for NLP: Introduction to the TextBlob Library" guide.
There are no special technical prerequisites needed for employing TextBlob. For instance, the package is applicable for both Python 2 and 3 (Python >= 2.7 or >= 3.5).
Also, in case you don't have any textual information at hand, TextBlob provides the necessary collections of language data (usually texts), called corpora, from the NLTK database.
Installing TextBlob
Let's start by installing TextBlob. If you are using a terminal, command-line, or command prompt, you can enter:
$ pip install textblob
Otherwise, if you are using a Jupyter Notebook, you can execute the command directly from the notebook by adding an exclamation mark !
at the beginning of the instruction:
!pip install textblob
Note: This process can take some time due to the broad number of algorithms and corpora that this library contains.
After installing TextBlob, in order to have text examples, you can download the corpora by executing the python -m textblob.download_corpora
command. Once again, you can execute it directly in the command line or in a notebook by preceding it with an exclamation mark.
When running the command, you should see the output below:
$ python -m textblob.download_corpora
[nltk_data] Downloading package brown to /Users/csamp/nltk_data...
[nltk_data] Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /Users/csamp/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/csamp/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /Users/csamp/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
[nltk_data] Downloading package conll2000 to /Users/csamp/nltk_data...
[nltk_data] Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data] /Users/csamp/nltk_data...
[nltk_data] Unzipping corpora/movie_reviews.zip.
Finished.
We have already installed the TextBlob package and its corpora. Now, let's understand more about lemmatization.
For more TextBlob content, check out our Simple NLP in Python with TextBlob: Tokenization, Simple NLP in Python with TextBlob: N-Grams Detection, and Sentiment Analysis in Python with TextBlob guides.
What is Lemmatization?
Before going deeper into the field of NLP, you should be able to recognize some key terms:
Corpus (or corpora in plural) - is a specific collection of language data (e.g., texts). Corpora are typically used for training various models of text classification or sentiment analysis, for instance.
Lemma - is the word you would look for in a dictionary. For instance, if you want to look at the definition for the verb "runs", you would search for "run".
Stem - is a part of a word that never changes.
What is lemmatization itself?
Lemmatization is the process of obtaining the lemmas of words from a corpus.
An illustration of this could be the following sentence:
- Input (corpus): Alice thinks she is lost, but then starts to find herself
- Output (lemmas): | Alice | think | she | is | lost | but | then | start | to | find | herself |
Notice that each word in the input sentence is lemmatized according to its context in the original sentence. For instance, "Alice" is a proper noun, so it stays the same, and the verbs "thinks" and "starts" are referenced in their base forms of "think" and "start".
Lemmatization is one of the basic stages of language processing. It brings words to their root forms or lemmas, which we would find if we were looking for them in a dictionary.
In the case of TextBlob, lemmatization is based on a database called WordNet, which is developed and maintained by Princeton University. Behind the scenes, TextBlob uses WordNet's morphy processor to obtain the lemma for a word.
Note: For further reference on how lemmatization works in TextBlob, you can take a peek at the documentation.
You probably won't notice significant changes with lemmatization unless you're working with large amounts of text. In that case, lemmatization helps reduce the size of words we might be searching for while trying to preserve their context in the sentence. It can be applied further in developing models of machine translation, search engine optimization, or various business inquiries.
Implementing Lemmatization in Code
First of all, it's necessary to establish a TextBlob object and define a sample corpus that will be lemmatized later. In this initial step, you can either write or define a string of text to use (as in this guide), or we can use an example from the NLTK corpus we have downloaded. Let's go with the latter.
Choosing a Review from the NLTK Corpus
For example, let's try to obtain the lemmas for a movie review that is in the corpus. To do this, we import both the TextBlob library and the movie_reviews
from the nltk.corpus
package:
# importing necessary libraries
from textblob import TextBlob
from nltk.corpus import movie_reviews
After importing, we can take a look at the movie reviews files with the fileids()
method. Since this code is running in a Jupyter Notebook, we can directly execute:
movie_reviews.fileids()
This will return a list of 2,000 text file names containing negative and positive reviews:
['neg/cv000_29416.txt',
'neg/cv001_19502.txt',
'neg/cv002_17424.txt',
'neg/cv003_12683.txt',
'neg/cv004_12641.txt',
'neg/cv005_29357.txt',
'neg/cv006_17022.txt',
'neg/cv007_4992.txt',
'neg/cv008_29326.txt',
'neg/cv009_29417.txt',
...]
Note: If you are running the code in another way, for instance, in a terminal or IDE, you can print the response by executing print(movie_reviews.fileids())
.
By looking at the neg in the name of the file, we can assume that the list starts with the negative reviews and ends with the positive ones. We can look at a positive review by indexing from the end of the list. Here, we are choosing the 1,989th review:
movie_reviews.fileids()[-10]
This results in:
'pos/cv990_11591.txt'
To examine the review sentences, we can pass the name of the review to the .sents()
method, which outputs a list of all review sentences:
movie_reviews.sents('pos/cv990_11591.txt')
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
[['the', 'relaxed', 'dude', 'rides', 'a', 'roller', 'coaster',
'the', 'big', 'lebowski', 'a', 'film', 'review', 'by', 'michael',
'redman', 'copyright', '1998', 'by', 'michael', 'redman', 'the',
'most', 'surreal', 'situations', 'are', 'ordinary', 'everyday',
'life', 'as', 'viewed', 'by', 'an', 'outsider', '.'], ['when',
'those', 'observers', 'are', 'joel', 'and', 'ethan', 'coen', ',',
'the', 'surreal', 'becomes', 'bizarre', '.'], ...]
Let's store this list in a variable called pos_review
:
pos_review = movie_reviews.sents("pos/cv990_11591.txt")
len(pos_review) #returns 63
Here, we can see that there are 63 sentences. Now, we can select one sentence to lemmatize, for instance, the 15th sentence:
sentence = pos_review[16]
type(sentence) # returns list
Creating a TextBlob Object
After selecting the sentence, we need to create a TextBlob object to be able to access the .lemmatize()
method. TextBlob objects need to be created from strings. Since we have a list, we can convert it to a string with the string.join()
method, joining based on blank spaces:
sentence_string = ' '.join(sentence)
Now that we have our sentence string, we can pass it to the TextBlob constructor:
blob_object = TextBlob(sentence_string)
Once we have the TextBlob object, we can perform various operations, such as lemmatization.
Lemmatization of a Sentence
Finally, to get the lemmatized words, we simply retrieve the words
attribute of the created blob_object
. This gives us a list containing Word objects that behave very similarly to string objects:
# Word tokenization of the sentence corpus
corpus_words = blob_object.words
# To see all tokens
print('sentence:', corpus_words)
# To count the number of tokens
number_of_tokens = len(corpus_words)
print('\nnumber of tokens:', number_of_tokens)
The output commands should give you the following:
sentence: ['the', 'carpet', 'is', 'important', 'to', 'him', 'because', 'it', 'pulls', 'the', 'room', 'together', 'not', 'surprisingly', 'since', 'it', 's', 'virtually', 'the', 'only', 'object', 'there']
number of tokens: 22
To lemmatize the words, we can just use the .lemmatize()
method:
corpus_words.lemmatize()
This gives us a lemmatized WordList object:
WordList(['the', 'carpet', 'is', 'important', 'to', 'him', 'because', 'it', 'pull', 'the',
'room', 'together', 'not', 'surprisingly', 'since', 'it', 's', 'virtually', 'the', 'only',
'object', 'there'])
Since this might be a little difficult to read, we can do a loop and print each word before and after lemmatization:
for word in corpus_words:
print(f'{word} | {word.lemmatize()}')
This results in:
the | the
carpet | carpet
is | is
important | important
to | to
him | him
because | because
it | it
pulls | pull
the | the
room | room
together | together
not | not
surprisingly | surprisingly
since | since
it | it
s | s
virtually | virtually
the | the
only | only
object | object
there | there
Notice how "pulls" changed to "pull"; the other words, besides "it's," were also lemmatized as expected. We can also see that "it's" has been separated due to the apostrophe. This indicates we can further pre-process the sentence so that "it's" is considered a word instead of "it" and an "s".
Difference Between Lemmatization and Stemming
Lemmatization is often confused with another technique called stemming. This confusion occurs because both techniques are usually employed to reduce words. While lemmatization uses dictionaries and focuses on the context of words in a sentence, attempting to preserve it, stemming uses rules to remove word affixes, focusing on obtaining the stem of a word.
Let's quickly modify our for loop to look at these differences:
print('word | lemma | stem\n')
for word in corpus_words:
print(f'{word} | {word.lemmatize()} | {word.stem()}')
This outputs:
the | the | the
carpet | carpet | carpet
is | is | is
important | important | import
to | to | to
him | him | him
because | because | becaus
it | it | it
pulls | pull | pull
the | the | the
room | room | room
together | together | togeth
not | not | not
surprisingly | surprisingly | surprisingli
since | since | sinc
it | it | it
s | s | s
virtually | virtually | virtual
the | the | the
only | only | onli
object | object | object
there | there | there
When looking at the above output, we can see how stemming can be problematic. It reduces "important" to "import", losing all the meaning of the word, which can even be considered a verb now; "because" to "becaus", which is a word that doesn't exist, same for "togeth", "surprisingli", "sinc", "onli".
There are clear differences between lemmatization and stemming. Understanding when to utilize each technique is the key. Suppose you are optimizing a word search and the focus is on being able to suggest the maximum amount of similar words, which technique would you use? When word context doesn't matter, and we could retrieve "important" with "import", the clear choice is stemming. On the other hand, if you are working on document text comparison, in which the position of the words in a sentence matters, and the context "importance" needs to be maintained and not confused with the verb "import", the best choice is lemmatization.
In the last scenario, suppose you are working on a word search followed by a retrieved document text comparison, what will you use? Both stemming and lemmatization.
We have understood the differences between stemming and lemmatization; now let's see how we can lemmatize the whole review instead of just a sentence.
Lemmatization of a Review
To lemmatize the entire review, we only need to modify the .join()
. Instead of joining words in a sentence, we will join sentences in a review:
# joining each sentence with a new line between them, and a space between each word
corpus_words = '\n'.join(' '.join(sentence) for sentence in pos_review)
After transforming the corpus into a string, we can proceed in the same way as it was for the sentence to lemmatize it:
blob_object = TextBlob(pos_rev)
corpus_words = blob_object.words
corpus_words.lemmatize()
This generates a WordList object with the full review text lemmatized. Here, we are omitting some parts with an ellipsis (...)
since the review is large, but you will be able to see it in its integral form. We can spot our sentence in the middle of it:
WordList(['the', 'relaxed', 'dude', 'rides', 'a', 'roller', 'coaster', 'the', 'big',
'lebowski', 'a', 'film', 'review', 'by', 'michael', 'redman', 'copyright', '1998', 'by',
'michael', 'redman', 'the', 'most', 'surreal', 'situations', 'are', 'ordinary', 'everyday',
'life', 'as', 'viewed', 'by', 'an', 'outsider', 'when', 'those', 'observers', 'are', 'joel',
(...)
'the', 'carpet', 'is', 'important', 'to', 'him', 'because', 'it', 'pulls', 'the', 'room',
'together', 'not', 'surprisingly', 'since', 'it', 's', 'virtually', 'the', 'only', 'object',
'there'
(...)
'com', 'is', 'the', 'eaddress', 'for', 'estuff'])
Conclusion
After lemmatizing the sentence and the review, we can see that both extract the corpus words first. This means lemmatization occurs at a word level, which also implies that it can be applied to a word, a sentence, or a full text. It works for a word or any collection of words.
This also suggests that it might be slower since it is necessary to break the text first into tokens to later apply it. And since lemmatization is context-specific, as we have seen, it is also crucial to have a good pre-processing of the text before using it, ensuring the correct breakdown into tokens and the appropriate part of speech tagging. Both will enhance results.
If you are not familiar with Part of Speech tagging (POS-tagging), check our Python for NLP: Parts of Speech Tagging and Named Entity Recognition guide.
We have also seen how lemmatization is different from stemming, another technique for reducing words that doesn't preserve their context. For this reason, it is usually faster.
There are many ways to perform lemmatization, and TextBlob is a great library for getting started with NLP. It offers a simple API that allows users to quickly begin working on NLP tasks. Leave a comment if you have used lemmatization in a project or plan to use it.
Happy coding!