Introduction
This is the seventh article in my series of articles on Python for NLP. In my previous article, I explained how to perform topic modeling using Latent Dirichlet Allocation and Non-Negative Matrix factorization. We used the Scikit-Learn library to perform topic modeling.
In this article, we will explore TextBlob, which is another extremely powerful NLP library for Python. TextBlob is built upon NLTK and provides an easy to use interface to the NLTK library. We will see how TextBlob can be used to perform a variety of NLP tasks ranging from parts-of-speech tagging to sentiment analysis, and language translation to text classification.
The detailed download instructions for the library can be found at the official link. I would suggest that you install the TextBlob library as well as the sample corpora.
Here is the gist of the instructions linked above, but be sure to check the official documentation for more instructions on installing if you need it:
$ pip install -U textblob
And to install the corpora:
$ python -m textblob.download_corpora
Let's now see the different functionalities of the TextBlob library.
Tokenization
Tokenization refers to splitting a large paragraph into sentences or words. Typically, a token refers to a word in a text document. Tokenization is pretty straight forward with TextBlob. All you have to do is import the TextBlob
object from the textblob
library, pass it the document that you want to tokenize, and then use the sentences
and words
attributes to get the tokenized sentences and attributes. Let's see this in action:
The first step is to import the TextBlob
object:
from textblob import TextBlob
Next, you need to define a string that contains the text of the document. We will create string that contains the first paragraph of the Wikipedia article on artificial intelligence.
document = ("In computer science, artificial intelligence (AI), \
sometimes called machine intelligence, is intelligence \
demonstrated by machines, in contrast to the natural intelligence \
displayed by humans and animals. Computer science defines AI \
research as the study of \"intelligent agents\": any device that \
perceives its environment and takes actions that maximize its\
chance of successfully achieving its goals.[1] Colloquially,\
the term \"artificial intelligence\" is used to describe machines\
that mimic \"cognitive\" functions that humans associate with other\
human minds, such as \"learning\" and \"problem solving\".[2]")
The next step is to pass this document as a parameter to the TextBlob
class. The returned object can then be used to tokenize the document to words and sentences.
text_blob_object = TextBlob(document)
Now to get the tokenized sentences, we can use the sentences
attribute:
document_sentence = text_blob_object.sentences
print(document_sentence)
print(len(document_sentence))
In the output, you will see the tokenized sentences along with the number of sentences.
[Sentence("In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals."), Sentence("Computer science defines AI research as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals."), Sentence("[1] Colloquially, the term "artificial intelligence" is used to describe machines that mimic "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving"."), Sentence("[2]")]
4
Similarly, the words
attribute returns the tokenized words in the document.
document_words = text_blob_object.words
print(document_words)
print(len(document_words))
The output looks like this:
['In', 'computer', 'science', 'artificial', 'intelligence', 'AI', 'sometimes', 'called', 'machine', 'intelligence', 'is', 'intelligence', 'demonstrated', 'by', 'machines', 'in', 'contrast', 'to', 'the', 'natural', 'intelligence', 'displayed', 'by', 'humans', 'and', 'animals', 'Computer', 'science', 'defines', 'AI', 'research', 'as', 'the', 'study', 'of', 'intelligent', 'agents', 'any', 'device', 'that', 'perceives', 'its', 'environment', 'and', 'takes', 'actions', 'that', 'maximize', 'its', 'chance', 'of', 'successfully', 'achieving', 'its', 'goals', '1', 'Colloquially', 'the', 'term', 'artificial', 'intelligence', 'is', 'used', 'to', 'describe', 'machines', 'that', 'mimic', 'cognitive', 'functions', 'that', 'humans', 'associate', 'with', 'other', 'human', 'minds', 'such', 'as', 'learning', 'and', 'problem', 'solving', '2']
84
Lemmatization
Lemmatization refers to reducing the word to its root form as found in a dictionary.
To perform lemmatization via TextBlob, you have to use the Word
object from the textblob
library, pass it the word that you want to lemmatize and then call the lemmatize
method.
from textblob import Word
word1 = Word("apples")
print("apples:", word1.lemmatize())
word2 = Word("media")
print("media:", word2.lemmatize())
word3 = Word("greater")
print("greater:", word3.lemmatize("a"))
In the script above, we perform lemmatization on the words "apples", "media", and "greater". In the output, you will see the words "apple", (which is singular for the apple), "medium" (which is singular for the medium) and "great" (which is the positive degree for the word greater). Notice that for the word greater, we pass "a" as a parameter to the lemmatize
method. This specifically tells the method that the word should be treated as an adjective. By default, the words are treated as nouns by the lemmatize()
method. The complete list for the parts of speech components is as follows:
ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
Parts of Speech (POS) Tagging
Like the spaCy and NLTK libraries, the TextBlob library also contains functionalities for the POS tagging.
To find POS tags for the words in a document, all you have to do is use the tags
attribute as shown below:
for word, pos in text_blob_object.tags:
print(word + " => " + pos)
In the script above, print the tags for all the words in the first paragraph of the Wikipedia article on Artificial Intelligence. The output of the script above looks like this:
The POS tags have been printed in the abbreviation form. To see the full form of each abbreviation, please consult this link.
Convert Text to Singular and Plural
TextBlob also allows you to convert text into a plural or singular form using the pluralize
and singularize
methods, respectively. Look at the following example:
text = ("Football is a good game. It has many health benefit")
text_blob_object = TextBlob(text)
print(text_blob_object.words.pluralize())
In the output, you will see the plural of all the words:
['Footballs', 'iss', 'some', 'goods', 'games', 'Its', 'hass', 'manies', 'healths', 'benefits']
Similarly, to singularize words you can use singularize
method as follows:
text = ("Footballs is a goods games. Its has many healths benefits")
text_blob_object = TextBlob(text)
print(text_blob_object.words.singularize())
The output of the script above looks like this:
['Football', 'is', 'a', 'good', 'game', 'It', 'ha', 'many', 'health', 'benefit']
Noun Phrase Extraction
Noun phrase extraction, as the name suggests, refers to extracting phrases that contain nouns. Let's find all the noun phrases in the first paragraph of the Wikipedia article on artificial intelligence that we used earlier.
To find noun phrases, you simply have to use the noun_phrase
attributes on the TextBlob
object. Look at the following example:
text_blob_object = TextBlob(document)
for noun_phrase in text_blob_object.noun_phrases:
print(noun_phrase)
The output looks like this:
computer science
artificial intelligence
ai
machine intelligence
natural intelligence
computer
science defines
ai
intelligent agents
colloquially
artificial intelligence
describe machines
human minds
You can see all the noun phrases in our document.
Getting Words and Phrase Counts
In a previous section, we used Python's built-in len
method to count the number of sentences, words and noun-phrases returned by the TextBlob
object. We can use TextBlob's built-in methods for the same purpose.
To find the frequency of occurrence of a particular word, we have to pass the name of the word as the index to the word_counts
list of the TextBlob
object.
In the following example, we will count the number of instances of the word "intelligence" in the first paragraph of the Wikipedia article on Artificial Intelligence.
text_blob_object = TextBlob(document)
text_blob_object.word_counts['intelligence']
Another way is to simply call the count
method on the words
attribute, and pass the name of the word whose frequency of occurrence is to be found as shown below:
text_blob_object.words.count('intelligence')
It is important to mention that by default the search is not case-sensitive. If you want your search to be case sensitive, you need to pass True
as the value for the case_sensitive
parameter, as shown below:
text_blob_object.words.count('intelligence', case_sensitive=True)
Like word counts, noun phrases can also be counted in the same way. The following example finds the phrase "artificial intelligence" in the paragraph.
text_blob_object = TextBlob(document)
text_blob_object.noun_phrases.count('artificial intelligence')
In the output, you will see 2.
Converting to Upper and Lowercase
TextBlob objects are very similar to strings. You can convert them to upper case or lower case, change their values, and concatenate them together as well. In the following script, we convert the text from the TextBlob object to upper case:
text = "I love to watch football, but I have never played it"
text_blob_object = TextBlob(text)
print(text_blob_object.upper())
In the output, you will the string in the upper case:
I LOVE TO WATCH FOOTBALL, BUT I HAVE NEVER PLAYED IT
Similarly to convert the text to lowercase, we can use the lower()
method as shown below:
text = "I LOVE TO WATCH FOOTBALL, BUT I HAVE NEVER PLAYED IT"
text_blob_object = TextBlob(text)
print(text_blob_object.lower())
Finding N-Grams
N-Grams refer to n combination of words in a sentence. For instance, for a sentence "I love watching football", some 2-grams would be (I love), (love watching) and (watching football). N-Grams can play a crucial role in text classification.
In TextBlob, N-grams can be found by passing the number of N-Grams to the ngrams
method of the TextBlob
object. Look at the following example:
text = "I love to watch football, but I have never played it"
text_blob_object = TextBlob(text)
for ngram in text_blob_object.ngrams(2):
print(ngram)
The output of the script looks like this:
['I', 'love']
['love', 'to']
['to', 'watch']
['watch', 'football']
['football', 'but']
['but', 'I']
['I', 'have']
['have', 'never']
['never', 'played']
['played', 'it']
This is especially helpful when training language models or doing any type of text prediction.
Spelling Corrections
Spelling correction is one of the unique functionalities of the TextBlob library. With the correct
method of the TextBlob
object, you can correct all the spelling mistakes in your text. Look at the following example:
text = "I love to watchf footbal, but I have neter played it"
text_blob_object = TextBlob(text)
print(text_blob_object.correct())
In the script above we made three spelling mistakes: "watchf" instead of "watch", "footbal" instead of "football", "neter" instead of "never". In the output, you will see that these mistakes have been corrected by TextBlob, as shown below:
I love to watch football, but I have never played it
Language Translation
One of the most powerful capabilities of the TextBlob library is to translate from one language to another. On the backend, the TextBlob language translator uses the Google Translate API
To translate from one language to another, you simply have to pass the text to the TextBlob
object and then call the translate
method on the object. The language code for the language that you want your text to be translated to is passed as a parameter to the method. Let's take a look at an example:
text_blob_object_french = TextBlob(u'Salut comment allez-vous?')
print(text_blob_object_french.translate(to='en'))
In the script above, we pass a sentence in the French language to the TextBlob
object. Next, we call the translate
method on the object and pass language code en
to the to
parameter. The language code en
corresponds to the English language. In the output, you will see the translation of the French sentence as shown below:
Hi, how are you?
Let's take another example where we will translate from Arabic to English:
text_blob_object_arabic = TextBlob(u'مرحبا كيف حالك؟')
print(text_blob_object_arabic.translate(to='en'))
Output:
Hi, how are you?
Finally, using the detect_language
method, you can also detect the language of the sentence. Look at the following script:
text_blob_object = TextBlob(u'Hola como estas?')
print(text_blob_object.detect_language())
In the output, you will see es
, which stands for the Spanish language.
The language code for all the languages can be found at this link.
Text Classification
TextBlob also provides basic text classification capabilities. Though, I would not recommend TextBlob for text classification owing to its limited capabilities, however, if you have a really limited data and you want to quickly develop a very basic text classification model, then you may use TextBlob. For advanced models, I would recommend machine learning libraries such as Scikit-Learn or Tensorflow.
Let's see how we can perform text classification with TextBlob. The first thing we need is a training dataset and test data. The classification model will be trained on the training dataset and will be evaluated on the test dataset.
Suppose we have the following training and test data:
train_data = [
('This is an excellent movie', 'pos'),
('The move was fantastic I like it', 'pos'),
('You should watch it, it is brilliant', 'pos'),
('Exceptionally good', 'pos'),
("Wonderfully directed and executed. I like it", 'pos'),
('It was very boring', 'neg'),
('I did not like the movie', 'neg'),
("The movie was horrible", 'neg'),
('I will not recommend', 'neg'),
('The acting is pathetic', 'neg')
]
test_data = [
('Its a fantastic series', 'pos'),
('Never watched such a brillent movie', 'pos'),
("horrible acting", 'neg'),
("It is a Wonderful movie", 'pos'),
('waste of money', 'neg'),
("pathetic picture", 'neg')
]
The dataset contains some dummy reviews about movies. You can see our training and test datasets consist of lists of tuples where the first element of the tuple is the text or a sentence while the second member of the tuple is the corresponding review or sentiment of the text.
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
We will train our dataset on the train_data
and will evaluate it on the test_data
. To do so, we will use the NaiveBayesClassifier
class from the textblob.classifiers
library. The following script imports the library:
from textblob.classifiers import NaiveBayesClassifier
To train the model, we simply have to pass the training data to the constructor of the NaiveBayesClassifier
class. The class will return an object trained on the dataset and capable of making predictions on the test set.
classifier = NaiveBayesClassifier(train_data)
Let's first make a prediction on a single sentence. To do so, we need to call the classify
method and pass it the sentence. Look at the following example:
print(classifier.classify("It is very boring"))
It looks like a negative review. When you execute the above script, you will see neg
in the output.
Similarly, the following script will return pos
since the review is positive.
print(classifier.classify("It's a fantastic series"))
You can also make a prediction by passing our classifier
to the classifier
parameter of the TextBlob
object. You then have to call the classify
method on the TextBlob
object to view the prediction.
sentence = TextBlob("It's a fantastic series.", classifier=classifier)
print(sentence.classify())
Finally, to find the accuracy of your algorithm on the test set, call the accuracy
method on your classifier and pass it the test_data
that we just created. Look at the following script:
classifier.accuracy(test_data)
In the output, you will see 0.66 which is the accuracy of the algorithm.
To find the most important features for the classification, the show_informative_features
method can be used. The number of most important features to see is passed as a parameter.
classifier.show_informative_features(3)
The output looks like this:
Most Informative Features
contains(it) = False neg : pos = 2.2 : 1.0
contains(is) = True pos : neg = 1.7 : 1.0
contains(was) = True neg : pos = 1.7 : 1.0
In this section, we tried to find the sentiment of the movie review using text classification. In reality, you don't have to perform text classification to find the sentiment of a sentence in TextBlob. The TextBlob library comes with a built-in sentiment analyzer which we will see in the next section.
Sentiment Analysis
In this section, we will analyze the sentiment of the public reviews for different foods purchased via Amazon. We will use the TextBlob sentiment analyzer to do so.
The dataset can be downloaded from this Kaggle link.
As a first step, we need to import the dataset. We will only import the first 20,000 records due to memory constraints. You can import more records if you want. The following script imports the dataset:
import pandas as pd
import numpy as np
reviews_datasets = pd.read_csv(r'E:\Datasets\Reviews.csv')
reviews_datasets = reviews_datasets.head(20000)
reviews_datasets.dropna()
To see how our dataset looks, we will use the head
method of the pandas data frame:
reviews_datasets.head()
The output looks like this:
From the output, you can see that the text review about the food is contained by the Text column. The score column contains ratings of the user for the particular product with 1 being the lowest and 5 being the highest rating.
Let's see the distribution of rating:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.distplot(reviews_datasets['Score'])
You can see that most of the ratings are highly positive i.e. 5. Let's plot the bar plot for the ratings to have a better look at the number of records for each rating.
sns.countplot(x='Score', data=reviews_datasets)
The output shows that more than half of reviews have 5-star ratings.
Let's randomly select a review and find its polarity using TextBlob. Let's take a look at review number 350.
reviews_datasets['Text'][350]
Output:
'These chocolate covered espresso beans are wonderful! The chocolate is very dark and rich and the "bean" inside is a very delightful blend of flavors with just enough caffine to really give it a zing.'
It looks like the review is positive. Let's verify this using the TextBlob library. To find the sentiment, we have to use the sentiment
attribute of the TextBlog
object. The sentiment
object returns a tuple that contains polarity and subjectivity of the review.
The value of polarity can be between -1 and 1 where the reviews with negative polarities have negative sentiments while the reviews with positive polarities have positive sentiments.
The subjectivity value can be between 0 and 1. Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information.
Let's find the sentiment of the 350th review.
text_blob_object = TextBlob(reviews_datasets['Text'][350])
print(text_blob_object.sentiment)
The output looks like this:
Sentiment(polarity=0.39666666666666667,subjectivity=0.6616666666666667)
The output shows that the review is positive with a high subjectivity.
Let's now add a column for sentiment polarity in our dataset. Execute the following script:
def find_pol(review):
return TextBlob(review).sentiment.polarity
reviews_datasets['Sentiment_Polarity'] = reviews_datasets['Text'].apply(find_pol)
reviews_datasets.head()
Now let's see the distribution of polarity in our dataset. Execute the following script:
sns.distplot(reviews_datasets['Sentiment_Polarity'])
The output of the script above looks like this:
It is evident from the figure above that most of the reviews are positive and have polarity between 0 and 0.5. This is natural since most of the reviews in the dataset have 5-star ratings.
Let's now plot the average polarity for each score rating.
sns.barplot(x='Score', y='Sentiment_Polarity', data=reviews_datasets)
Output:
The output clearly shows that the reviews with high rating scores have high positive polarities.
Let's now see some of the most negative reviews i.e. the reviews with a polarity value of -1.
most_negative = reviews_datasets[reviews_datasets.Sentiment_Polarity == -1].Text.head()
print(most_negative)
The output looks like this:
545 These chips are nasty. I thought someone had ...
1083 All my fault. I thought this would be a carton...
1832 Pop Chips are basically a horribly over-priced...
2087 I do not consider Gingerbread, Spicy Eggnog, C...
2763 This popcorn has alot of hulls I order 4 bags ...
Name: Text, dtype: object
Let's print the value of review number 545.
reviews_datasets['Text'][545]
In the output, you will see the following review:
'These chips are nasty. I thought someone had spilled a drink in the bag, no the chips were just soaked with grease. Nasty!!'
The output clearly shows that the review is highly negative.
Let's now see some of the most positive reviews. Execute the following script:
most_positive = reviews_datasets[reviews_datasets.Sentiment_Polarity == 1].Text.head()
print(most_positive)
The output looks like this:
106 not what I was expecting in terms of the compa...
223 This is an excellent tea. One of the best I h...
338 I like a lot of sesame oil and use it in salad...
796 My mother and father were the recipient of the...
1031 The Kelloggs Muselix are delicious and the del...
Name: Text, dtype: object
Let's see review 106 in detail:
reviews_datasets['Text'][106]
Output:
"not what I was expecting in terms of the company's reputation for excellent home delivery products"
You can see that though the review was not very positive, it has been assigned a polarity of 1 due to the presence of words like excellent
and reputation
. It is important to know that sentiment analyzer is not 100% error-proof and might predict wrong sentiment in a few cases, such as the one we just saw.
Let's now see review number 223 which also has been marked as positive.
reviews_datasets['Text'][223]
The output looks like this:
"This is an excellent tea. One of the best I have ever had. It is especially great when you prepare it with a samovar."
The output clearly depicts that the review is highly positive.
Conclusion
Python's TextBlob library is one of the most famous and widely used natural language processing libraries. This article explains several functionalities of the TextBlob library, such as tokenization, stemming, sentiment analysis, text classification and language translation in detail.
In the next article I'll go over the Pattern library, which provides a lot of really useful functions for determining attributes about sentences, as well as tools for retrieving data from social networks, Wikipedia, and search engines.