Simple NLP in Python With TextBlob: Tokenization

Introduction

The amount of textual data on the Internet has significantly increased in the past decades. There's no doubt that the processing of this amount of information must be automated, and the TextBlob package is one of the fairly simple ways to perform NLP - Natural Language Processing.

It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, tokenization, sentiment analysis, classification, translation, and more.

No special technical prerequisites for employing this library are needed. For instance, TextBlob is applicable for both Python 2 and 3. In case you don't have any textual information for the project you want to work on, TextBlob provides necessary corpora from the NLTK database.

Installing TextBlob

Let's start out by installing TextBlob and the NLTK corpora:

$ pip install -U textblob
$ python -m textblob.download_corpora

Note: This process can take some time due to a broad number of algorithms and corpora that this library contains.

What is Tokenization?

Before going deeper into the field of NLP you should also be able to recognize these key terms:

  • Corpus (or corpora in plural) - is simply a certain collection of language data (e.g. texts). Corpora are normally used for training different models of text classification or sentiment analysis, for instance.

  • Token - is a final string that is detached from the primary text, or in other words, it's an output of tokenization.

What is tokenization itself?

Tokenization or word segmentation is a simple process of separating sentences or words from the corpus into small units, i.e. tokens.

An illustration of this could be the following sentence:

  • Input (corpus): The evil that men do lives after them

  • Output (tokens): | The | evil | that | men | do | lives | after | them |

Here, the input sentence is tokenized on the basis of spaces between words. You can also tokenize characters from a single word (e.g. a-p-p-l-e from apple) or separate sentences from one text.

Tokenization is one of the basic and crucial stages of language processing. It transforms unstructured textual material into data. This could be applied further in developing various models of machine translation, search engine optimization, or different business inquiries.

Implementing Tokenization in Code

First of all, it's necessary to establish a TextBlob object and define a sample corpus that will be tokenized later. For example, let's try to tokenize a part of the poem If written by R. Kipling:

from textblob import TextBlob

# Creating the corpus
corpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''

Once the object is created, it should be passed as an argument to the TextBlob constructor:

blob_object = TextBlob(corpus)

Once constructed, we can perform various operations on this blob_object. It already contains our corpus, categorized to a degree.

Word Tokenization

Finally, to get the tokenized words we simply retrieve the words attribute to the created blob_object. This gives us a list containing Word objects, that behave very similarly to str objects:

from textblob import TextBlob

corpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''

blob_object = TextBlob(corpus)

# Word tokenization of the sample corpus
corpus_words = blob_object.words
# To see all tokens
print(corpus_words)
# To count the number of tokens
print(len(corpus_words))

The output commands should give you the following:

['If', 'you', 'can', 'force', 'your', 'heart', 'and', 'nerve', 'and', 'sinew', 'to', 'serve', 'your', 'turn', 'long', 'after', 'they', 'are', 'gone', 'and', 'so', 'hold', 'on', 'when', 'there', 'is', 'nothing', 'in', 'you', 'except', 'the', 'Will', 'which', 'says', 'to', 'them', 'Hold', 'on']
38

It's worth noting that this approach tokenizes words using SPACE as the delimiting character. We can change this delimiter, for example, to a TAB:

from textblob import TextBlob
from nltk.tokenize import TabTokenizer

corpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. 	And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''

tokenizer = TabTokenizer()
blob_object = TextBlob(corpus, tokenizer = tokenizer)

# Word tokenization of the sample corpus
corpus_words = blob_object.tokens
# To see all tokens
print(corpus_words)

Note that we've added a TAB after the first sentence here. How, the corpus of the words looks something like:

['If you can force your heart and nerve and sinew to serve your turn long after they are gone.','And so hold on when there is nothing in you except the Will which says to them: 'Hold on!']

nltk.tokenize contains other tokenization options as well. By default, it uses the SpaceTokenizer which you don't need to define explicitly, but can. Other than these two, it also contains useful tokenizers such as LineTokenizer, BlankLineTokenizer and WordPunctTokenizer.

A full list can be found in their documentation.

Sentence Tokenization

To tokenize on a sentence-level, we'll use the same blob_object. This time, instead of the words attribute, we will use the sentences attribute. This returns a list of Sentence objects:

from textblob import TextBlob

corpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''

blob_object = TextBlob(corpus)

# Sentence tokenization of the sample corpus
corpus_sentence = blob_object.sentences
# To identify all tokens
print(corpus_sentence)
# To count the number of tokens
print(len(corpus_sentence))

Output:

[Sentence("If you can force your heart and nerve and sinew to serve your turn long after they are gone"), Sentence("And so hold on when there is nothing in you except the Will which says to them: 'Hold on!")]
2

Conclusion

Tokenization is a very important data pre-processing step in NLP and involves breaking down of a text into smaller chunks called tokens. These tokens can be individual words, sentences or characters in the original text.

TextBlob is a great library to get into NLP with since it offers a simple API that lets users quickly jump into performing NLP tasks.

In this article, we discussed just one of the NLP tasks that TextBlob deals with, but in a next series, we will take a look at how to solve more complex issues, such as dealing with word inflections, plural and singular forms of words, and more.

Author image
Italy