Introduction
In today's digital world, there is a vast amount of text data created and transferred in the form of news, tweets, and social media posts. Can you imagine the time and effort needed to process them manually? Fortunately, Natural Language Processing (NLP) techniques help us manipulate, analyze, and interpret text data quickly and efficiently. NLP is a branch of Artificial Intelligence, where we train machines to understand human language and perform tasks ranging from summarization and translation to sentiment analysis.
One of the common requirements in NLP is dealing with singular and plural forms of words and converting one to another. In this article, we'll discuss how to perform pluralization and singularization of words using the Python package TextBlob.
What do singularization and pluralization mean?
We know nouns can be singular or plural, such as book/books, cat/cats, and sweet/sweets. We need techniques to convert text to plural/singular forms to help in data preprocessing and to achieve better accuracy in translation, text analysis, and text generation. This task of conversion from singular to plural and vice versa is referred to as word inflection in NLP.
Let's look at the common patterns/rules for converting singular to plural in the English language:
-
Adding "-s" at the end: This is the most common pattern. For example, cot->cots, desk->desks, lake->lakes. Removing 's' is the rule followed for plural to singular conversion.
-
Adding "-es" at the end: This rule is usually followed for words ending in 's', 'x', 'z', 'ch', or 'sh' sounds. Examples: bus -> buses, church-> churches, brush-> brushes.
-
Replacing "-y" with "-ies": It is followed for nouns ending in a consonant along with a "y". Examples: puppy -> puppies, city -> cities.
-
Irregular patterns: Some nouns do not follow any of the patterns mentioned above. For example, child->children, foot->feet, mouse->mice. Handling these can be a bit tricky.
Introduction to TextBlob and Installation
The Python programming language provides a variety of packages that allow us to implement various tasks in NLP. One of the widely used packages is TextBlob, which offers functions to easily perform various NLP tasks, including singularization/pluralization. This library is built on top of NLTK (Natural Language Toolkit) and is easy to learn. You can check out the official documentation of TextBlob to learn about all the functions it offers.
Let's start by installing the library using the pip
package manager by running the command below.
$ pip install textblob
Once the installation is complete, you can import the library into your Python notebook or script.
from textblob import Word
You can import the Word
class from the module. When a text is passed in the form of a string to this class, a TextBlob Word object will be created, upon which various functions can be called to perform tokenization, word inflection, lemmatization, etc.
In the snippet below, we create a TextBlob object of the Word
class 'doc1' by passing a text string.
text = "I usually take a bus from my university to the park."
doc1 = Word(text)
In the next sections, we'll show how different functions can be used on the TextBlob class object to perform pluralization and singularization.
Pluralization with TextBlob
We can easily convert a noun from its singular to plural form using the pluralize()
function in TextBlob. Let's look at how to get the plural form of 'puppy' in the example code below. Simply create a TextBlob object of the word, and call the function pluralize()
on it.
from textblob import Word
blob = Word("puppy")
pluralized_word = blob.pluralize()
print(pluralized_word)
# Output: puppies
TextBlob can handle most of the common patterns of pluralization, including irregular ones. Let's check this out with a few more examples:
plural_form = Word("box").pluralize()
print(plural_form) # Output: boxes
plural_form = Word("man").pluralize()
print(plural_form) # Output: men
plural_form = Word("tooth").pluralize()
print(plural_form) # Output: teeth
plural_form = Word("ox").pluralize()
print(plural_form) # Output: oxen
From the above output, you can observe that TextBlob handles irregular patterns like ox->oxen and child->children as well.
Now, let us consider the word "water." Can you convert it to plural? No! Words like water and furniture are referred to as uncountable nouns and remain unchanged between singular/plural forms. Fortunately, TextBlob is equipped to understand these cases and does not modify them.
# Pluralization of Uncountable Nouns
plural_form = Word('water').pluralize()
print(plural_form) # Output: water
plural_form = Word("information").pluralize()
print(plural_form) # Output: information
How to Pluralize
Specific Words in a Sentence?
Often, while dealing with text documents, you may want to modify particular words. Let's say a text is "The container had multiple boxes filled with canned goods", and you wish to convert 'box' to 'boxes' to make it grammatically consistent. First, create a TextBlob object for the sentence. On this object, use the 'words()
' function to get a list of the words in the sentence. Now, you can use the index of these words to convert any of them into their plural form. Note that you also need to import the NLTK package to use the functions to extract the list of words, due to package dependency.
import nltk
nltk.download('punkt')
from textblob import TextBlob
sentence = "The container had multiple box filled with canned goods"
# Create a TextBlob object
blob = TextBlob(sentence)
# Get the list of words in the sentence
words = blob.words
# Pluralize the 5th word
pluralized_word = words[4].pluralize()
The modified sentence you get in the output would be:
# Output: The container had multiple boxes filled with canned goods
The output is as desired, with the changes made. This method can be used to singularize
or pluralize
all nouns in an entire text document too.
Singularization with TextBlob
Converting a noun from its plural form to its singular form is known as singularization and is commonly used to obtain base words. To avoid redundancy while working with large text corpora, plural forms are often reduced to their root or singular form. It helps us standardize the text data and reduce the dimensionality of the vocabulary. Similar to pluralization, TextBlob also provides a built-in method singularize()
to handle singularization.
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
As we did in the previous section, create a TextBlob class object for your word and call the singularize()
method on it. This method can handle different rules of singularization, including irregular cases. I have demonstrated this with a diverse set of examples below.
# Removing ‘s’
singular_form = Word("curtains").singularize()
print(singular_form) # Output: curtain
# Removing ‘es’
singular_form = Word("mattresses").singularize()
print(singular_form) # Output: mattress
# Changing internal vowel ('a' to 'o')
singular_form = Word("geese").singularize()
print(singular_form) # Output: goose
# Completely irregular form
singular_form = Word("mice").singularize()
print(singular_form) # Output: mouse
How can you singularize
all nouns in your document?
As we discussed at the beginning of this section, standardizing the nouns in the text is a common need in NLP. Let's say we have a paragraph in our document as shown below.
# Define the paragraph
paragraph = "The bookshop around the corner sells all genres of books, including fiction, non-fiction, autobiographies, and much more. The quality of the books is amazing, and they are all bestsellers. I recently bought the New York Times Bestseller Atomic Habits from there."
If you notice, the words 'book' and 'books', 'bestseller' and 'bestsellers' are present in this. For contextual understanding in NLP, only the root word is essential. Hence, we can convert them into standard singular forms and reduce the size of our vocabulary. I'll now show you how to do this easily with TextBlob.
# To extract nouns
nltk.download('averaged_perceptron_tagger')
# Create a TextBlob object
blob = TextBlob(paragraph)
# Extract nouns from the paragraph
nouns = [word.singularize() for word, tag in blob.tags if tag.startswith('NN')]
# Get unique singularized nouns
unique_nouns = set(nouns)
for noun in unique_nouns:
print(noun)
# Output: ['fiction', 'corner', 'book', 'genre', 'quality', 'shop', 'habit', 'autobiography']
From the output, you can verify that every noun has been converted to its singular form (standardized)!
How to Define Custom Rules for Singularization & Pluralization?
While TextBlob's built-in functions singularize()
provide accurate results for most nouns in English, it isn't foolproof. I'll walk you through some common situations where TextBlob may not be able to handle the inflection.
Once you define these rules, TextBlob will remember them and automatically apply them when it encounters these specific words.
To help you understand better, I'll walk you through some common cases where setting custom rules may be necessary:
-
Non-English Words: TextBlob's default rules are primarily designed for English words. If you encounter non-English words that follow different pluralization or singularization patterns, you may need to set custom rules. For example: Cactus -> Cacti, Octopus -> Octopi.
-
Domain-specific Terms: Certain domain-specific terms may have unique pluralization or singularization forms that are not covered by the default rules. Example: Virus -> Viri, FAQ -> FAQs (acronyms)
What can we do in these cases?
You can use an alternative Python package called 'pattern' to easily define custom rules for word inflection. Import the singularize
and pluralize
functions from the pattern.en
module. Next, we can define custom singularization and pluralization rules using dictionaries as shown below.
#!pip install pattern
import pattern
from pattern.en import singularize, pluralize
custom_plural_rules = {'virus': 'viri', 'SAT': 'SATs'}
# Define functions for custom singularization and pluralization
def custom_plural(word):
if word in custom_plural_rules:
return custom_plural_rules[word]
return pluralize(word)
# Example usage
print(custom_plural('virus')) # Output: Viri
If the input word is present in the dictionary, our custom rules will be applied. Otherwise, the default rules will be used for conversion.
When should you choose TextBlob and why?
One of the main advantages of TextBlob is its easy-to-use syntax and API, which makes it beginner-friendly. Apart from singularization/pluralization tasks, we can also perform a diverse set of tasks including part-of-speech tagging, noun phrase extraction, sentiment analysis, and more, as it comes with pretrained models. It integrates seamlessly with other tools commonly used in the Python ecosystem, such as NLTK.
On the other hand, TextBlob may not be the best choice for large-scale text-processing tasks that are computationally intensive or when precise control is needed. In these cases, you can consider alternative packages such as spaCy and Stanford CoreNLP. spaCy provides robust capabilities for handling pluralization and singularization while dealing with large-scale text processing. You can check out the spaCy documentation here.
Conclusion
The TextBlob library's built-in methods pluralize()
and singularize()
are efficient ways to perform word inflection tasks. These methods can handle the different patterns of singular/plural nouns in the English language. TextBlob can also handle most irregular patterns. These features and functionalities make TextBlob a great choice for word inflection tasks.
If you want to understand how to perform other NLP techniques in TextBlob, you can check these articles. Pattern is another useful package if you want to define custom rules of inflection for your domain-specific terms. You can evaluate TextBlob and its alternatives based on your performance needs. I hope you enjoyed the read!