Python for NLP: Working with the Gensim Library (Part 1)

This is the 10th article in my series of articles on Python for NLP. In my previous article, I explained how the StanfordCoreNLP library can be used to perform different NLP tasks.

In this article, we will explore the Gensim library, which is another extremely useful NLP library for Python. Gensim was primarily developed for topic modeling. However, it now supports a variety of other NLP tasks such as converting words to vectors (word2vec), document to vectors (doc2vec), finding text similarity, and text summarization.

In this article and the next article of the series, we will see how the Gensim library is used to perform these tasks.

Installing Gensim

If you use pip installer to install your Python libraries, you can use the following command to download the Gensim library:

$ pip install gensim

Alternatively, if you use the Anaconda distribution of Python, you can execute the following command to install the Gensim library:

$ conda install -c anaconda gensim

Let's now see how we can perform different NLP tasks using the Gensim library.

Creating Dictionaries

Statistical algorithms work with numbers, however, natural languages contain data in the form of text. Therefore, a mechanism is needed to convert words to numbers. Similarly, after applying different types of processes on the numbers, we need to convert numbers back to text.

One way to achieve this type of functionality is to create a dictionary that assigns a numeric ID to every unique word in the document. The dictionary can then be used to find the numeric equivalent of a word and vice versa.

Creating Dictionaries using In-Memory Objects

It is super easy to create dictionaries that map words to IDs using Python's Gensim library. Look at the following script:

import gensim
from gensim import corpora
from pprint import pprint

text = ["""In computer science, artificial intelligence (AI),
             sometimes called machine intelligence, is intelligence
             demonstrated by machines, in contrast to the natural intelligence
             displayed by humans and animals. Computer science defines
             AI research as the study of intelligent agents: any device that
             perceives its environment and takes actions that maximize its chance
             of successfully achieving its goals."""]

tokens = [[token for token in sentence.split()] for sentence in text]
gensim_dictionary = corpora.Dictionary(tokens)

print("The dictionary has: " +str(len(gensim_dictionary)) + " tokens")

for k, v in gensim_dictionary.token2id.items():
    print(f'{k:{15}} {v:{10}}')

In the script above, we first import the gensim library along with the corpora module from the library. Next, we have some text (which is the first part of the first paragraph of the Wikipedia article on Artificial Intelligence) stored in the text variable.

To create a dictionary, we need a list of words from our text (also known as tokens). In the following line, we split our document into sentences and then the sentences into words.

tokens = [[token for token in sentence.split()] for sentence in text]

We are now ready to create our dictionary. To do so, we can use the Dictionary object of the corpora module and pass it the list of tokens.

Finally, to print the contents of the newly created dictionary, we can use the token2id object of the Dictionary class. The output of the script above looks like this:

The dictionary has: 46 tokens
(AI),                    0
AI                       1
Computer                 2
In                       3
achieving                4
actions                  5
agents:                  6
and                      7
animals.                 8
any                      9
artificial              10
as                      11
by                      12
called                  13
chance                  14
computer                15
contrast                16
defines                 17
demonstrated            18
device                  19
displayed               20
environment             21
goals.                  22
humans                  23
in                      24
intelligence            25
intelligence,           26
intelligent             27
is                      28
its                     29
machine                 30
machines,               31
maximize                32
natural                 33
of                      34
perceives               35
research                36
science                 37
science,                38
sometimes               39
study                   40
successfully            41
takes                   42
that                    43
the                     44
to                      45

The output shows each unique word in our text along with the numeric ID that the word has been assigned. The word or token is the key of the dictionary and the ID is the value. You can also see the Id assigned to the individual word using the following script:

print(gensim_dictionary.token2id["study"])

In the script above, we pass the word "study" as the key to our dictionary. In the output, you should see the corresponding value i.e. the ID of the word "study", which is 40.

Similarly, you can use the following script to find the key or word for a specific ID.

print(list(gensim_dictionary.token2id.keys())[list(gensim_dictionary.token2id.values()).index(40)])

To print the tokens and their corresponding IDs we used a for-loop. However, you can directly print the tokens and their IDs by printing the dictionary, as shown here:

print(gensim_dictionary.token2id)

The output is as follows:

{'(AI),': 0, 'AI': 1, 'Computer': 2, 'In': 3, 'achieving': 4, 'actions': 5, 'agents:': 6, 'and': 7, 'animals.': 8, 'any': 9, 'artificial': 10, 'as': 11, 'by': 12, 'called': 13, 'chance': 14, 'computer': 15, 'contrast': 16, 'defines': 17, 'demonstrated': 18, 'device': 19, 'displayed': 20, 'environment': 21, 'goals.': 22, 'humans': 23, 'in': 24, 'intelligence': 25, 'intelligence,': 26, 'intelligent': 27, 'is': 28, 'its': 29, 'machine': 30, 'machines,': 31, 'maximize': 32, 'natural': 33, 'of': 34, 'perceives': 35, 'research': 36, 'science': 37, 'science,': 38, 'sometimes': 39, 'study': 40, 'successfully': 41, 'takes': 42, 'that': 43, 'the': 44, 'to': 45}

The output might not be as clear as the one printed using the loop, although it still serves its purpose.

Let's now see how we can add more tokens to an existing dictionary using a new document. Look at the following script:

text = ["""Colloquially, the term "artificial intelligence" is used to
           describe machines that mimic "cognitive" functions that humans
           associate with other human minds, such as "learning" and "problem solving"""]

tokens = [[token for token in sentence.split()] for sentence in text]
gensim_dictionary.add_documents(tokens)

print("The dictionary has: " + str(len(gensim_dictionary)) + " tokens")
print(gensim_dictionary.token2id)

In the script above we have a new document that contains the second part of the first paragraph of the Wikipedia article on Artificial Intelligence. We split the text into tokens and then simply call the add_documents method to add the tokens to our existing dictionary. Finally, we print the updated dictionary on the console.

The output of the code looks like this:

The dictionary has: 65 tokens
{'(AI),': 0, 'AI': 1, 'Computer': 2, 'In': 3, 'achieving': 4, 'actions': 5, 'agents:': 6, 'and': 7, 'animals.': 8, 'any': 9, 'artificial': 10, 'as': 11, 'by': 12, 'called': 13, 'chance': 14, 'computer': 15, 'contrast': 16, 'defines': 17, 'demonstrated': 18, 'device': 19, 'displayed': 20, 'environment': 21, 'goals.': 22, 'humans': 23, 'in': 24, 'intelligence': 25, 'intelligence,': 26, 'intelligent': 27, 'is': 28, 'its': 29, 'machine': 30, 'machines,': 31, 'maximize': 32, 'natural': 33, 'of': 34, 'perceives': 35, 'research': 36, 'science': 37, 'science,': 38, 'sometimes': 39, 'study': 40, 'successfully': 41, 'takes': 42, 'that': 43, 'the': 44, 'to': 45, '"artificial': 46, '"cognitive"': 47, '"learning"': 48, '"problem': 49, 'Colloquially,': 50, 'associate': 51, 'describe': 52, 'functions': 53, 'human': 54, 'intelligence"': 55, 'machines': 56, 'mimic': 57, 'minds,': 58, 'other': 59, 'solving': 60, 'such': 61, 'term': 62, 'used': 63, 'with': 64}

You can see that now we have 65 tokens in our dictionary, while previously we had 45 tokens.

Creating Dictionaries using Text Files

In the previous section, we had in-memory text. What if we want to create a dictionary by reading a text file from the hard drive? To do so, we can use the simple_process method from the gensim.utils library. The advantage of using this method is that it reads the text file line by line and returns the tokens in the line. You don't have to load the complete text file in the memory in order to create a dictionary.

Before executing the next example, create a file "file1.txt" and add the following text to the file (this is the first half of the first paragraph of the Wikipedia article on Global Warming).

Global warming is a long-term rise in the average temperature of the Earth's climate system, an aspect of climate change shown by temperature measurements and by multiple effects of the warming. Though earlier geological periods also experienced episodes of warming, the term commonly refers to the observed and continuing increase in average air and ocean temperatures since 1900 caused mainly by emissions of greenhouse gasses in the modern industrial economy.

Now let's create a dictionary that will contain tokens from the text file "file1.txt":

from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

gensim_dictionary = corpora.Dictionary(simple_preprocess(sentence, deacc=True) for sentence in open(r'E:\\text files\\file1.txt', encoding='utf-8'))

print(gensim_dictionary.token2id)

In the script above we read the text file "file1.txt" line-by-line using the simple_preprocess method. The method returns tokens in each line of the document. The tokens are then used to create the dictionary. In the output, you should see the tokens and their corresponding IDs, as shown below:

{'average': 0, 'climate': 1, 'earth': 2, 'global': 3, 'in': 4, 'is': 5, 'long': 6, 'of': 7, 'rise': 8, 'system': 9, 'temperature': 10, 'term': 11, 'the': 12, 'warming': 13, 'an': 14, 'and': 15, 'aspect': 16, 'by': 17, 'change': 18, 'effects': 19, 'measurements': 20, 'multiple': 21, 'shown': 22, 'also': 23, 'earlier': 24, 'episodes': 25, 'experienced': 26, 'geological': 27, 'periods': 28, 'though': 29, 'air': 30, 'commonly': 31, 'continuing': 32, 'increase': 33, 'observed': 34, 'ocean': 35, 'refers': 36, 'temperatures': 37, 'to': 38, 'caused': 39, 'economy': 40, 'emissions': 41, 'gasses': 42, 'greenhouse': 43, 'industrial': 44, 'mainly': 45, 'modern': 46, 'since': 47}

Similarly, we can create a dictionary by reading multiple text files. Create another file "file2.txt" and add the following text to the file (the second part of the first paragraph of the Wikipedia article on Global Warming):

In the modern context the terms global warming and climate change are commonly used interchangeably, but climate change includes both global warming and its effects, such as changes to precipitation and impacts that differ by region.[7][8] Many of the observed warming changes since the 1950s are unprecedented in the instrumental temperature record, and in historical and paleoclimate proxy records of climate change over thousands to millions of years.

Save the "file2.txt" in the same directory as the "file1.txt".

The following script reads both the files and then creates a dictionary based on the text in the two files:

from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

class ReturnTokens(object):
    def __init__(self, dir_path):
        self.dir_path = dir_path

    def __iter__(self):
        for file_name in os.listdir(self.dir_path):
            for sentence in open(os.path.join(self.dir_path, file_name), encoding='utf-8'):
                yield simple_preprocess(sentence)

path_to_text_directory = r"E:\text files"
gensim_dictionary = corpora.Dictionary(ReturnTokens(path_to_text_directory))

print(gensim_dictionary.token2id)

In the script above we have a method ReturnTokens, which takes the directory path that contains "file1.txt" and "file2.txt" as the only parameter. Inside the method we iterate through all the files in the directory and then read each file line by line. The simple_preprocess method creates tokens for each line. The tokens for each line are returned to the calling function using the "yield" keyword.

In the output, you should see the following tokens along with their IDs:

{'average': 0, 'climate': 1, 'earth': 2, 'global': 3, 'in': 4, 'is': 5, 'long': 6, 'of': 7, 'rise': 8, 'system': 9, 'temperature': 10, 'term': 11, 'the': 12, 'warming': 13, 'an': 14, 'and': 15, 'aspect': 16, 'by': 17, 'change': 18, 'effects': 19, 'measurements': 20, 'multiple': 21, 'shown': 22, 'also': 23, 'earlier': 24, 'episodes': 25, 'experienced': 26, 'geological': 27, 'periods': 28, 'though': 29, 'air': 30, 'commonly': 31, 'continuing': 32, 'increase': 33, 'observed': 34, 'ocean': 35, 'refers': 36, 'temperatures': 37, 'to': 38, 'caused': 39, 'economy': 40, 'emissions': 41, 'gasses': 42, 'greenhouse': 43, 'industrial': 44, 'mainly': 45, 'modern': 46, 'since': 47, 'are': 48, 'context': 49, 'interchangeably': 50, 'terms': 51, 'used': 52, 'as': 53, 'both': 54, 'but': 55, 'changes': 56, 'includes': 57, 'its': 58, 'precipitation': 59, 'such': 60, 'differ': 61, 'impacts': 62, 'instrumental': 63, 'many': 64, 'record': 65, 'region': 66, 'that': 67, 'unprecedented': 68, 'historical': 69, 'millions': 70, 'over': 71, 'paleoclimate': 72, 'proxy': 73, 'records': 74, 'thousands': 75, 'years': 76}

Creating Bag of Words Corpus

Dictionaries contain mappings between words and their corresponding numeric values. Bag of words corpora in the Gensim library are based on dictionaries and contain the ID of each word along with the frequency of occurrence of the word.

Creating Bag of Words Corpus from In-Memory Objects

Look at the following script:

import gensim
from gensim import corpora
from pprint import pprint

text = ["""In computer science, artificial intelligence (AI),
           sometimes called machine intelligence, is intelligence
           demonstrated by machines, in contrast to the natural intelligence
           displayed by humans and animals. Computer science defines
           AI research as the study of intelligent agents: any device that
           perceives its environment and takes actions that maximize its chance
           of successfully achieving its goals."""]

tokens = [[token for token in sentence.split()] for sentence in text]

gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]

print(gensim_corpus)

In the script above, we have text which we split into tokens. Next, we initialize a Dictionary object from the corpora module. The object contains a method doc2bow, which basically performs two tasks:

  • It iterates through all the words in the text, if the word already exists in the corpus, it increments the frequency count for the word
  • Otherwise it inserts the word into the corpus and sets its frequency count to 1

The output of the above script looks like this:

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 3), (26, 1), (27, 1), (28, 1), (29, 3), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 2), (44, 2), (45, 1)]]

The output might not make sense to you. Let me explain it. The first tuple (0,1) basically means that the word with ID 0 occurred 1 time in the text. Similarly, (25, 3) means that the word with ID 25 occurred three times in the document.

Let's now print the word and the frequency count to make things clear. Add the following lines of code at the end of the previous script:

word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]
print(word_frequencies)

The output looks like this:

[[('(AI),', 1), ('AI', 1), ('Computer', 1), ('In', 1), ('achieving', 1), ('actions', 1), ('agents:', 1), ('and', 2), ('animals.', 1), ('any', 1), ('artificial', 1), ('as', 1), ('by', 2), ('called', 1), ('chance', 1), ('computer', 1), ('contrast', 1), ('defines', 1), ('demonstrated', 1), ('device', 1), ('displayed', 1), ('environment', 1), ('goals.', 1), ('humans', 1), ('in', 1), ('intelligence', 3), ('intelligence,', 1), ('intelligent', 1), ('is', 1), ('its', 3), ('machine', 1), ('machines,', 1), ('maximize', 1), ('natural', 1), ('of', 2), ('perceives', 1), ('research', 1), ('science', 1), ('science,', 1), ('sometimes', 1), ('study', 1), ('successfully', 1), ('takes', 1), ('that', 2), ('the', 2), ('to', 1)]]

From the output, you can see that the word "intelligence" appears three times. Similarly, the word "that" appears twice.

Creating Bag of Words Corpus from Text Files

Like dictionaries, we can also create a bag of words corpus by reading a text file. Look at the following code:

from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

tokens = [simple_preprocess(sentence, deacc=True) for sentence in open(r'E:\text files\file1.txt', encoding='utf-8')]

gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]
word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]

print(word_frequencies)

In the script above, we created a bag of words corpus using "file1.txt". In the output, you should see the words in the first paragraph for the Global Warming article on Wikipedia.

[[('average', 1), ('climate', 1), ('earth', 1), ('global', 1), ('in', 1), ('is', 1), ('long', 1), ('of', 1), ('rise', 1), ('system', 1), ('temperature', 1), ('term', 1), ('the', 2), ('warming', 1)], [('climate', 1), ('of', 2), ('temperature', 1), ('the', 1), ('warming', 1), ('an', 1), ('and', 1), ('aspect', 1), ('by', 2), ('change', 1), ('effects', 1), ('measurements', 1), ('multiple', 1), ('shown', 1)], [('of', 1), ('warming', 1), ('also', 1), ('earlier', 1), ('episodes', 1), ('experienced', 1), ('geological', 1), ('periods', 1), ('though', 1)], [('average', 1), ('in', 1), ('term', 1), ('the', 2), ('and', 2), ('air', 1), ('commonly', 1), ('continuing', 1), ('increase', 1), ('observed', 1), ('ocean', 1), ('refers', 1), ('temperatures', 1), ('to', 1)], [('in', 1), ('of', 1), ('the', 1), ('by', 1), ('caused', 1), ('economy', 1), ('emissions', 1), ('gasses', 1), ('greenhouse', 1), ('industrial', 1), ('mainly', 1), ('modern', 1), ('since', 1)]]

The output, shows that the words like "of", "the", "by", and "and" occur twice.

Similarly, you can create a bag of words corpus using multiple text files, as shown below:

from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

class ReturnTokens(object):
    def __init__(self, dir_path):
        self.dir_path = dir_path

    def __iter__(self):
        for file_name in os.listdir(self.dir_path):
            for sentence in open(os.path.join(self.dir_path, file_name), encoding='utf-8'):
                yield simple_preprocess(sentence)

path_to_text_directory = r"E:\text files"

gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in ReturnTokens(path_to_text_directory)]
word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]

print(word_frequencies)

The output of the script above looks like this:

[[('average', 1), ('climate', 1), ('earth', 1), ('global', 1), ('in', 1), ('is', 1), ('long', 1), ('of', 1), ('rise', 1), ('system', 1), ('temperature', 1), ('term', 1), ('the', 2), ('warming', 1)], [('climate', 1), ('of', 2), ('temperature', 1), ('the', 1), ('warming', 1), ('an', 1), ('and', 1), ('aspect', 1), ('by', 2), ('change', 1), ('effects', 1), ('measurements', 1), ('multiple', 1), ('shown', 1)], [('of', 1), ('warming', 1), ('also', 1), ('earlier', 1), ('episodes', 1), ('experienced', 1), ('geological', 1), ('periods', 1), ('though', 1)], [('average', 1), ('in', 1), ('term', 1), ('the', 2), ('and', 2), ('air', 1), ('commonly', 1), ('continuing', 1), ('increase', 1), ('observed', 1), ('ocean', 1), ('refers', 1), ('temperatures', 1), ('to', 1)], [('in', 1), ('of', 1), ('the', 1), ('by', 1), ('caused', 1), ('economy', 1), ('emissions', 1), ('gasses', 1), ('greenhouse', 1), ('industrial', 1), ('mainly', 1), ('modern', 1), ('since', 1)], [('climate', 1), ('global', 1), ('in', 1), ('the', 2), ('warming', 1), ('and', 1), ('change', 1), ('commonly', 1), ('modern', 1), ('are', 1), ('context', 1), ('interchangeably', 1), ('terms', 1), ('used', 1)], [('climate', 1), ('global', 1), ('warming', 1), ('and', 2), ('change', 1), ('effects', 1), ('to', 1), ('as', 1), ('both', 1), ('but', 1), ('changes', 1), ('includes', 1), ('its', 1), ('precipitation', 1), ('such', 1)], [('in', 1), ('of', 1), ('temperature', 1), ('the', 3), ('warming', 1), ('by', 1), ('observed', 1), ('since', 1), ('are', 1), ('changes', 1), ('differ', 1), ('impacts', 1), ('instrumental', 1), ('many', 1), ('record', 1), ('region', 1), ('that', 1), ('unprecedented', 1)], [('climate', 1), ('in', 1), ('of', 2), ('and', 2), ('change', 1), ('to', 1), ('historical', 1), ('millions', 1), ('over', 1), ('paleoclimate', 1), ('proxy', 1), ('records', 1), ('thousands', 1), ('years', 1)]]

Creating TF-IDF Corpus

The bag of words approach works fine for converting text to numbers. However, it has one drawback. It assigns a score to a word based on its occurrence in a particular document. It doesn't take into account the fact that the word might also have a high frequency of occurrences in other documents as well. TF-IDF resolves this issue.

The term frequency is calculated as:

Term frequency = (Frequency of the word in a document)/(Total words in the document)

And the Inverse Document Frequency is calculated as:

IDF(word) = Log((Total number of documents)/(Number of documents containing the word))

Using the Gensim library, we can easily create a TF-IDF corpus:

import gensim
from gensim import corpora
from pprint import pprint

text = ["I like to play Football",
       "Football is the best game",
       "Which game do you like to play ?"]

tokens = [[token for token in sentence.split()] for sentence in text]

gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]

from gensim import models
import numpy as np

tfidf = models.TfidfModel(gensim_corpus, smartirs='ntc')

for sent in tfidf[gensim_corpus]:
    print([[gensim_dictionary[id], np.around(frequency, decimals=2)] for id, frequency in sent])

To find the TF-IDF value, we can use the TfidfModel class from the models module of the Gensim library. We simply have to pass the bag of word corpus as a parameter to the constructor of the TfidfModel class. In the output, you will see all of the words in the three sentences, along with their TF-IDF values:

[['Football', 0.3], ['I', 0.8], ['like', 0.3], ['play', 0.3], ['to', 0.3]]
[['Football', 0.2], ['best', 0.55], ['game', 0.2], ['is', 0.55], ['the', 0.55]]
[['like', 0.17], ['play', 0.17], ['to', 0.17], ['game', 0.17], ['?', 0.47], ['Which', 0.47], ['do', 0.47], ['you', 0.47]]

Downloading Built-In Gensim Models and Datasets

Gensim comes with a variety of built-in datasets and word embedding models that can be directly used.

To download a built-in model or dataset, we can use the downloader class from the gensim library. We can then call the load method on the downloader class to download the desired package. Look at the following code:

import gensim.downloader as api

w2v_embedding = api.load("glove-wiki-gigaword-100")

With the commands above, we download the "glove-wiki-gigaword-100" word embedding model, which is basically based on Wikipedia text and is 100 dimensional. Let's try to find the words similar to "toyota" using our word embedding model. Use the following code to do so:

w2v_embedding.most_similar('toyota')

In the output, you should see the following results:

[('honda', 0.8739858865737915),
 ('nissan', 0.8108116984367371),
 ('automaker', 0.7918163537979126),
 ('mazda', 0.7687169313430786),
 ('bmw', 0.7616022825241089),
 ('ford', 0.7547588348388672),
 ('motors', 0.7539199590682983),
 ('volkswagen', 0.7176680564880371),
 ('prius', 0.7156582474708557),
 ('chrysler', 0.7085398435592651)]

You can see all the results are very relevant to the word "toyota". The number in the fraction corresponds to the similarity index. Higher similarity index means that the word is more relevant.

Conclusion

The Gensim library is one of the most popular Python libraries for NLP. In this article, we briefly explored how the Gensim library can be used to perform tasks like a dictionary and corpus creation. We also saw how to download built-in Gensim modules. In our next article, we will see how to perform topic modeling via the Gensim library.

Author image
About Usman Malik
Paris (France) Twitter
Programmer | Blogger | Data Science Enthusiast | PhD To Be | Arsenal FC for Life