Introduction
In the field of Natural Language Processing (NLP), one of the fundamental tasks is Parts of Speech (PoS) tagging. PoS tagging involves assigning grammatical categories, like nouns, verbs, adjectives, etc., to words in a sentence. This process plays an important role in many NLP applications, including text analysis, information retrieval, and machine translation.
In this article, we will explore how to perform PoS tagging using the TextBlob library in Python.
What is Part of Speech (PoS) Tagging?
Part of Speech tagging is the process of labeling words in a sentence with their respective grammatical categories. Each word is assigned a tag based on its syntactic role and function within the sentence. These PoS tags provide valuable information about the word's behavior and its relationship with other words in the sentence.
For example, consider the sentence: "The cat is sleeping." Here, "cat" is a noun, "is" is a verb, and "sleeping" is a verb.
Installin TextBlob
Before we dive into PoS tagging with TextBlob, let's make sure that we have the necessary libraries installed. To install TextBlob, you can use the following command:
$ pip install textblob
Additionally, we need to install the required language resources by running the following command:
$ python -m textblob.download_corpora
Preprocessing Steps
Before performing PoS tagging, we'll need to preprocess the text by removing any unnecessary elements and normalizing the words. Typical preprocessing steps include removing punctuation, converting text to lowercase, and handling contractions. Contraction is a form of normalization, like converting "can't" to "cannot".
Let's take a look at an example to understand these steps better:
from textblob import TextBlob
import re
def preprocess_text(text):
# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)
# Convert to lowercase
text = text.lower()
# Handle contractions (e.g., "can't" becomes "cannot")
text = TextBlob(text).correct()
return text
# Example usage
sentence = "I can't wait to see the movie!"
preprocessed_sentence = preprocess_text(sentence)
print(preprocessed_sentence)
In the above code, we define a preprocess_text
function that takes a sentence as input and performs the preprocessing steps. The function removes punctuation using regular expressions, converts the text to lowercase, and corrects any contractions using TextBlob's correct
method. This preprocessing results in the following output:
i can wait to see the movie
Meaning Mapping Table of PoS Tags
To understand the PoS tags assigned by TextBlob, it's helpful to have a "meaning mapping" table that provides a description of each tag along with the examples. Here is a descriptive version of the table:
Tag | Description | Examples |
---|---|---|
CC | Coordinating conjunction | and, or, but |
CD | Cardinal number | 1, 2, 3 |
DT | Determiner | the, a, an |
EX | Existential there | there |
FW | Foreign word | bonjour, hola |
IN | Preposition/subordinating conjunction | in, on, after |
JJ | Adjective | beautiful, happy |
JJR | Adjective, comparative | bigger, stronger |
JJS | Adjective, superlative | biggest, strongest |
LS | List item marker | 1, 2, 3 |
MD | Modal | can, could, may |
NN | Noun, singular or mass | cat, dog, happiness |
NNS | Noun, plural | cats, dogs, books |
NNP | Proper noun, singular | John, London, Google |
NNPS | Proper noun, plural | Smiths, Apples, Microsoft |
PDT | Predeterminer | all, both, half |
POS | Possessive ending | 's, ' |
PRP | Personal pronoun | I, you, he |
PRP$ | Possessive pronoun | my, your, his |
RB | Adverb | quickly, very |
RBR | Adverb, comparative | faster, stronger |
RBS | Adverb, superlative | fastest, strongest |
RP | Particle | up, off, down |
SYM | Symbol | $, %, + |
TO | to | to |
UH | Interjection | oh, wow, hey |
VB | Verb, base form | eat, run, play |
VBD | Verb, past tense | ate, ran, played |
VBG | Verb, gerund or present participle | eating, running, playing |
VBN | Verb, past participle | eaten, run, played |
VBP | Verb, non-3rd person singular present | eat, run, play |
VBZ | Verb, 3rd person singular present | eats, runs, plays |
WDT | Wh-determiner | which, what |
WP | Wh-pronoun | who, what, whom |
WP$ | Possessive wh-pronoun | whose |
WRB | Wh-adverb | where, when, how |
Basic Implementation (Extracting All PoS)
Now, let's explore the basic implementation of PoS tagging using TextBlob. The following code snippet demonstrates how to extract and print all the PoS tags from a given sentence:
from textblob import TextBlob
sentence = "The cat is sleeping."
blob = TextBlob(sentence)
for word, tag in blob.tags:
print(word, "-", tag)
In the code above, we create a TextBlob
object by passing the sentence to it. Then, we iterate over each word and its corresponding PoS tag using the tags
property of the TextBlob
object. We print the word and its tag on separate lines. The output would be the following:
The - DT
cat - NN
is - VBZ
sleeping - VBG
Advanced Implementation (Selective Extraction of PoS)
In some cases, we may only be interested in extracting specific PoS tags. Using the data returned by TextBlob, we can perform selective extraction by specifying the desired tags and filtering on those.
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Here's an example:
from textblob import TextBlob
sentence = "The cat is sleeping."
blob = TextBlob(sentence)
desired_tags = ["NN", "VB"]
selected_words = [word for word, tag in blob.tags if tag in desired_tags]
print(selected_words)
In the above code, we define a list of desired PoS tags (desired_tags
). We then create a new list (selected_words
) using a list comprehension that filters the words based on their tags. Finally, we print the selected words:
['cat']
Drawbacks and Improvements
While TextBlob provides a convenient way to perform PoS tagging, it is important to note that it may not always produce perfect results. The accuracy of PoS tagging heavily relies on the quality of the underlying model and the context of the text being analyzed. In cases where high precision is required, more advanced techniques and models may be necessary.
To improve the accuracy of PoS tagging, you can consider using other libraries, such as spaCy or NLTK, which offer more sophisticated PoS tagging models. Additionally, fine-tuning or training custom models on domain-specific data can help improve the results for specific tasks.
Conclusion
PoS tagging is a fundamental task in Natural Language Processing, and TextBlob provides a simple and accessible way to perform PoS tagging in Python. In this article, we explored the concept of PoS tagging, learned how to install the necessary libraries, preprocess the text, and implemented basic and advanced PoS tagging using TextBlob.
We also discussed the meaning mapping table of PoS tags, highlighted the drawbacks and potential improvements, and provided real-life examples. Armed with this knowledge, you can now leverage PoS tagging in your NLP projects to gain valuable insights from text data.