This is the 10th article in my series of articles on Python for NLP. In my previous article, I explained how the StanfordCoreNLP library can be used to perform different NLP tasks.
In this article, we will explore the Gensim library, which is another extremely useful NLP library for Python. Gensim was primarily developed for topic modeling. However, it now supports a variety of other NLP tasks such as converting words to vectors (word2vec), document to vectors (doc2vec), finding text similarity, and text summarization.
In this article and the next article of the series, we will see how the Gensim library is used to perform these tasks.
Installing Gensim
If you use pip installer to install your Python libraries, you can use the following command to download the Gensim library:
$ pip install gensim
Alternatively, if you use the Anaconda distribution of Python, you can execute the following command to install the Gensim library:
$ conda install -c anaconda gensim
Let's now see how we can perform different NLP tasks using the Gensim library.
Creating Dictionaries
Statistical algorithms work with numbers, however, natural languages contain data in the form of text. Therefore, a mechanism is needed to convert words to numbers. Similarly, after applying different types of processes on the numbers, we need to convert numbers back to text.
One way to achieve this type of functionality is to create a dictionary that assigns a numeric ID to every unique word in the document. The dictionary can then be used to find the numeric equivalent of a word and vice versa.
Creating Dictionaries using In-Memory Objects
It is super easy to create dictionaries that map words to IDs using Python's Gensim library. Look at the following script:
import gensim
from gensim import corpora
from pprint import pprint
text = ["""In computer science, artificial intelligence (AI),
sometimes called machine intelligence, is intelligence
demonstrated by machines, in contrast to the natural intelligence
displayed by humans and animals. Computer science defines
AI research as the study of intelligent agents: any device that
perceives its environment and takes actions that maximize its chance
of successfully achieving its goals."""]
tokens = [[token for token in sentence.split()] for sentence in text]
gensim_dictionary = corpora.Dictionary(tokens)
print("The dictionary has: " +str(len(gensim_dictionary)) + " tokens")
for k, v in gensim_dictionary.token2id.items():
print(f'{k:{15}} {v:{10}}')
In the script above, we first import the gensim
library along with the corpora
module from the library. Next, we have some text (which is the first part of the first paragraph of the Wikipedia article on Artificial Intelligence) stored in the text
variable.
To create a dictionary, we need a list of words from our text (also known as tokens). In the following line, we split our document into sentences and then the sentences into words.
tokens = [[token for token in sentence.split()] for sentence in text]
We are now ready to create our dictionary. To do so, we can use the Dictionary
object of the corpora
module and pass it the list of tokens.
Finally, to print the contents of the newly created dictionary, we can use the token2id
object of the Dictionary
class. The output of the script above looks like this:
The dictionary has: 46 tokens
(AI), 0
AI 1
Computer 2
In 3
achieving 4
actions 5
agents: 6
and 7
animals. 8
any 9
artificial 10
as 11
by 12
called 13
chance 14
computer 15
contrast 16
defines 17
demonstrated 18
device 19
displayed 20
environment 21
goals. 22
humans 23
in 24
intelligence 25
intelligence, 26
intelligent 27
is 28
its 29
machine 30
machines, 31
maximize 32
natural 33
of 34
perceives 35
research 36
science 37
science, 38
sometimes 39
study 40
successfully 41
takes 42
that 43
the 44
to 45
The output shows each unique word in our text along with the numeric ID that the word has been assigned. The word or token is the key of the dictionary and the ID is the value. You can also see the Id assigned to the individual word using the following script:
print(gensim_dictionary.token2id["study"])
In the script above, we pass the word "study" as the key to our dictionary. In the output, you should see the corresponding value i.e. the ID of the word "study", which is 40.
Similarly, you can use the following script to find the key or word for a specific ID.
print(list(gensim_dictionary.token2id.keys())[list(gensim_dictionary.token2id.values()).index(40)])
To print the tokens and their corresponding IDs we used a for-loop. However, you can directly print the tokens and their IDs by printing the dictionary, as shown here:
print(gensim_dictionary.token2id)
The output is as follows:
{'(AI),': 0, 'AI': 1, 'Computer': 2, 'In': 3, 'achieving': 4, 'actions': 5, 'agents:': 6, 'and': 7, 'animals.': 8, 'any': 9, 'artificial': 10, 'as': 11, 'by': 12, 'called': 13, 'chance': 14, 'computer': 15, 'contrast': 16, 'defines': 17, 'demonstrated': 18, 'device': 19, 'displayed': 20, 'environment': 21, 'goals.': 22, 'humans': 23, 'in': 24, 'intelligence': 25, 'intelligence,': 26, 'intelligent': 27, 'is': 28, 'its': 29, 'machine': 30, 'machines,': 31, 'maximize': 32, 'natural': 33, 'of': 34, 'perceives': 35, 'research': 36, 'science': 37, 'science,': 38, 'sometimes': 39, 'study': 40, 'successfully': 41, 'takes': 42, 'that': 43, 'the': 44, 'to': 45}
The output might not be as clear as the one printed using the loop, although it still serves its purpose.
Let's now see how we can add more tokens to an existing dictionary using a new document. Look at the following script:
text = ["""Colloquially, the term "artificial intelligence" is used to
describe machines that mimic "cognitive" functions that humans
associate with other human minds, such as "learning" and "problem solving"""]
tokens = [[token for token in sentence.split()] for sentence in text]
gensim_dictionary.add_documents(tokens)
print("The dictionary has: " + str(len(gensim_dictionary)) + " tokens")
print(gensim_dictionary.token2id)
In the script above we have a new document that contains the second part of the first paragraph of the Wikipedia article on Artificial Intelligence. We split the text into tokens and then simply call the add_documents
method to add the tokens to our existing dictionary. Finally, we print the updated dictionary on the console.
The output of the code looks like this:
The dictionary has: 65 tokens
{'(AI),': 0, 'AI': 1, 'Computer': 2, 'In': 3, 'achieving': 4, 'actions': 5, 'agents:': 6, 'and': 7, 'animals.': 8, 'any': 9, 'artificial': 10, 'as': 11, 'by': 12, 'called': 13, 'chance': 14, 'computer': 15, 'contrast': 16, 'defines': 17, 'demonstrated': 18, 'device': 19, 'displayed': 20, 'environment': 21, 'goals.': 22, 'humans': 23, 'in': 24, 'intelligence': 25, 'intelligence,': 26, 'intelligent': 27, 'is': 28, 'its': 29, 'machine': 30, 'machines,': 31, 'maximize': 32, 'natural': 33, 'of': 34, 'perceives': 35, 'research': 36, 'science': 37, 'science,': 38, 'sometimes': 39, 'study': 40, 'successfully': 41, 'takes': 42, 'that': 43, 'the': 44, 'to': 45, '"artificial': 46, '"cognitive"': 47, '"learning"': 48, '"problem': 49, 'Colloquially,': 50, 'associate': 51, 'describe': 52, 'functions': 53, 'human': 54, 'intelligence"': 55, 'machines': 56, 'mimic': 57, 'minds,': 58, 'other': 59, 'solving': 60, 'such': 61, 'term': 62, 'used': 63, 'with': 64}
You can see that now we have 65 tokens in our dictionary, while previously we had 45 tokens.
Creating Dictionaries using Text Files
In the previous section, we had in-memory text. What if we want to create a dictionary by reading a text file from the hard drive? To do so, we can use the simple_process
method from the gensim.utils
library. The advantage of using this method is that it reads the text file line by line and returns the tokens in the line. You don't have to load the complete text file in the memory in order to create a dictionary.
Before executing the next example, create a file "file1.txt" and add the following text to the file (this is the first half of the first paragraph of the Wikipedia article on Global Warming).
Global warming is a long-term rise in the average temperature of the Earth's climate system, an aspect of climate change shown by temperature measurements and by multiple effects of the warming. Though earlier geological periods also experienced episodes of warming, the term commonly refers to the observed and continuing increase in average air and ocean temperatures since 1900 caused mainly by emissions of greenhouse gasses in the modern industrial economy.
Now let's create a dictionary that will contain tokens from the text file "file1.txt":
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os
gensim_dictionary = corpora.Dictionary(simple_preprocess(sentence, deacc=True) for sentence in open(r'E:\\text files\\file1.txt', encoding='utf-8'))
print(gensim_dictionary.token2id)
In the script above we read the text file "file1.txt" line-by-line using the simple_preprocess
method. The method returns tokens in each line of the document. The tokens are then used to create the dictionary. In the output, you should see the tokens and their corresponding IDs, as shown below:
{'average': 0, 'climate': 1, 'earth': 2, 'global': 3, 'in': 4, 'is': 5, 'long': 6, 'of': 7, 'rise': 8, 'system': 9, 'temperature': 10, 'term': 11, 'the': 12, 'warming': 13, 'an': 14, 'and': 15, 'aspect': 16, 'by': 17, 'change': 18, 'effects': 19, 'measurements': 20, 'multiple': 21, 'shown': 22, 'also': 23, 'earlier': 24, 'episodes': 25, 'experienced': 26, 'geological': 27, 'periods': 28, 'though': 29, 'air': 30, 'commonly': 31, 'continuing': 32, 'increase': 33, 'observed': 34, 'ocean': 35, 'refers': 36, 'temperatures': 37, 'to': 38, 'caused': 39, 'economy': 40, 'emissions': 41, 'gasses': 42, 'greenhouse': 43, 'industrial': 44, 'mainly': 45, 'modern': 46, 'since': 47}
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Similarly, we can create a dictionary by reading multiple text files. Create another file "file2.txt" and add the following text to the file (the second part of the first paragraph of the Wikipedia article on Global Warming):
In the modern context the terms global warming and climate change are commonly used interchangeably, but climate change includes both global warming and its effects, such as changes to precipitation and impacts that differ by region.[7][8] Many of the observed warming changes since the 1950s are unprecedented in the instrumental temperature record, and in historical and paleoclimate proxy records of climate change over thousands to millions of years.
Save the "file2.txt" in the same directory as the "file1.txt".
The following script reads both the files and then creates a dictionary based on the text in the two files:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os
class ReturnTokens(object):
def __init__(self, dir_path):
self.dir_path = dir_path
def __iter__(self):
for file_name in os.listdir(self.dir_path):
for sentence in open(os.path.join(self.dir_path, file_name), encoding='utf-8'):
yield simple_preprocess(sentence)
path_to_text_directory = r"E:\text files"
gensim_dictionary = corpora.Dictionary(ReturnTokens(path_to_text_directory))
print(gensim_dictionary.token2id)
In the script above we have a method ReturnTokens
, which takes the directory path that contains "file1.txt" and "file2.txt" as the only parameter. Inside the method we iterate through all the files in the directory and then read each file line by line. The simple_preprocess
method creates tokens for each line. The tokens for each line are returned to the calling function using the "yield" keyword.
In the output, you should see the following tokens along with their IDs:
{'average': 0, 'climate': 1, 'earth': 2, 'global': 3, 'in': 4, 'is': 5, 'long': 6, 'of': 7, 'rise': 8, 'system': 9, 'temperature': 10, 'term': 11, 'the': 12, 'warming': 13, 'an': 14, 'and': 15, 'aspect': 16, 'by': 17, 'change': 18, 'effects': 19, 'measurements': 20, 'multiple': 21, 'shown': 22, 'also': 23, 'earlier': 24, 'episodes': 25, 'experienced': 26, 'geological': 27, 'periods': 28, 'though': 29, 'air': 30, 'commonly': 31, 'continuing': 32, 'increase': 33, 'observed': 34, 'ocean': 35, 'refers': 36, 'temperatures': 37, 'to': 38, 'caused': 39, 'economy': 40, 'emissions': 41, 'gasses': 42, 'greenhouse': 43, 'industrial': 44, 'mainly': 45, 'modern': 46, 'since': 47, 'are': 48, 'context': 49, 'interchangeably': 50, 'terms': 51, 'used': 52, 'as': 53, 'both': 54, 'but': 55, 'changes': 56, 'includes': 57, 'its': 58, 'precipitation': 59, 'such': 60, 'differ': 61, 'impacts': 62, 'instrumental': 63, 'many': 64, 'record': 65, 'region': 66, 'that': 67, 'unprecedented': 68, 'historical': 69, 'millions': 70, 'over': 71, 'paleoclimate': 72, 'proxy': 73, 'records': 74, 'thousands': 75, 'years': 76}
Creating Bag of Words Corpus
Dictionaries contain mappings between words and their corresponding numeric values. The Bag of Words corpora in the Gensim library are based on dictionaries and contain the ID of each word along with the frequency of occurrence of the word.
Creating Bag of Words Corpus from In-Memory Objects
Look at the following script:
import gensim
from gensim import corpora
from pprint import pprint
text = ["""In computer science, artificial intelligence (AI),
sometimes called machine intelligence, is intelligence
demonstrated by machines, in contrast to the natural intelligence
displayed by humans and animals. Computer science defines
AI research as the study of intelligent agents: any device that
perceives its environment and takes actions that maximize its chance
of successfully achieving its goals."""]
tokens = [[token for token in sentence.split()] for sentence in text]
gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]
print(gensim_corpus)
In the script above, we have text which we split into tokens. Next, we initialize a Dictionary
object from the corpora
module. The object contains a method doc2bow
, which basically performs two tasks:
- It iterates through all the words in the text, if the word already exists in the corpus, it increments the frequency count for the word
- Otherwise it inserts the word into the corpus and sets its frequency count to 1
The output of the above script looks like this:
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 3), (26, 1), (27, 1), (28, 1), (29, 3), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 2), (44, 2), (45, 1)]]
The output might not make sense to you. Let me explain it. The first tuple (0,1) basically means that the word with ID 0 occurred 1 time in the text. Similarly, (25, 3) means that the word with ID 25 occurred three times in the document.
Let's now print the word and the frequency count to make things clear. Add the following lines of code at the end of the previous script:
word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]
print(word_frequencies)
The output looks like this:
[[('(AI),', 1), ('AI', 1), ('Computer', 1), ('In', 1), ('achieving', 1), ('actions', 1), ('agents:', 1), ('and', 2), ('animals.', 1), ('any', 1), ('artificial', 1), ('as', 1), ('by', 2), ('called', 1), ('chance', 1), ('computer', 1), ('contrast', 1), ('defines', 1), ('demonstrated', 1), ('device', 1), ('displayed', 1), ('environment', 1), ('goals.', 1), ('humans', 1), ('in', 1), ('intelligence', 3), ('intelligence,', 1), ('intelligent', 1), ('is', 1), ('its', 3), ('machine', 1), ('machines,', 1), ('maximize', 1), ('natural', 1), ('of', 2), ('perceives', 1), ('research', 1), ('science', 1), ('science,', 1), ('sometimes', 1), ('study', 1), ('successfully', 1), ('takes', 1), ('that', 2), ('the', 2), ('to', 1)]]
From the output, you can see that the word "intelligence" appears three times. Similarly, the word "that" appears twice.
Creating Bag of Words Corpus from Text Files
Like dictionaries, we can also create a bag of words corpus by reading a text file. Look at the following code:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os
tokens = [simple_preprocess(sentence, deacc=True) for sentence in open(r'E:\text files\file1.txt', encoding='utf-8')]
gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]
word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]
print(word_frequencies)
In the script above, we created a bag-of-words corpus using file1.txt
. In the output, you should see the words in the first paragraph for the Global Warming article on Wikipedia.
[[('average', 1), ('climate', 1), ('earth', 1), ('global', 1), ('in', 1), ('is', 1), ('long', 1), ('of', 1), ('rise', 1), ('system', 1), ('temperature', 1), ('term', 1), ('the', 2), ('warming', 1)], [('climate', 1), ('of', 2), ('temperature', 1), ('the', 1), ('warming', 1), ('an', 1), ('and', 1), ('aspect', 1), ('by', 2), ('change', 1), ('effects', 1), ('measurements', 1), ('multiple', 1), ('shown', 1)], [('of', 1), ('warming', 1), ('also', 1), ('earlier', 1), ('episodes', 1), ('experienced', 1), ('geological', 1), ('periods', 1), ('though', 1)], [('average', 1), ('in', 1), ('term', 1), ('the', 2), ('and', 2), ('air', 1), ('commonly', 1), ('continuing', 1), ('increase', 1), ('observed', 1), ('ocean', 1), ('refers', 1), ('temperatures', 1), ('to', 1)], [('in', 1), ('of', 1), ('the', 1), ('by', 1), ('caused', 1), ('economy', 1), ('emissions', 1), ('gasses', 1), ('greenhouse', 1), ('industrial', 1), ('mainly', 1), ('modern', 1), ('since', 1)]]
The output shows that the words like "of", "the", "by", and "and" occur twice.
Similarly, you can create a bag of words corpus using multiple text files, as shown below:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os
class ReturnTokens(object):
def __init__(self, dir_path):
self.dir_path = dir_path
def __iter__(self):
for file_name in os.listdir(self.dir_path):
for sentence in open(os.path.join(self.dir_path, file_name), encoding='utf-8'):
yield simple_preprocess(sentence)
path_to_text_directory = r"E:\text files"
gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in ReturnTokens(path_to_text_directory)]
word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]
print(word_frequencies)
The output of the script above looks like this:
[[('average', 1), ('climate', 1), ('earth', 1), ('global', 1), ('in', 1), ('is', 1), ('long', 1), ('of', 1), ('rise', 1), ('system', 1), ('temperature', 1), ('term', 1), ('the', 2), ('warming', 1)], [('climate', 1), ('of', 2), ('temperature', 1), ('the', 1), ('warming', 1), ('an', 1), ('and', 1), ('aspect', 1), ('by', 2), ('change', 1), ('effects', 1), ('measurements', 1), ('multiple', 1), ('shown', 1)], [('of', 1), ('warming', 1), ('also', 1), ('earlier', 1), ('episodes', 1), ('experienced', 1), ('geological', 1), ('periods', 1), ('though', 1)], [('average', 1), ('in', 1), ('term', 1), ('the', 2), ('and', 2), ('air', 1), ('commonly', 1), ('continuing', 1), ('increase', 1), ('observed', 1), ('ocean', 1), ('refers', 1), ('temperatures', 1), ('to', 1)], [('in', 1), ('of', 1), ('the', 1), ('by', 1), ('caused', 1), ('economy', 1), ('emissions', 1), ('gasses', 1), ('greenhouse', 1), ('industrial', 1), ('mainly', 1), ('modern', 1), ('since', 1)], [('climate', 1), ('global', 1), ('in', 1), ('the', 2), ('warming', 1), ('and', 1), ('change', 1), ('commonly', 1), ('modern', 1), ('are', 1), ('context', 1), ('interchangeably', 1), ('terms', 1), ('used', 1)], [('climate', 1), ('global', 1), ('warming', 1), ('and', 2), ('change', 1), ('effects', 1), ('to', 1), ('as', 1), ('both', 1), ('but', 1), ('changes', 1), ('includes', 1), ('its', 1), ('precipitation', 1), ('such', 1)], [('in', 1), ('of', 1), ('temperature', 1), ('the', 3), ('warming', 1), ('by', 1), ('observed', 1), ('since', 1), ('are', 1), ('changes', 1), ('differ', 1), ('impacts', 1), ('instrumental', 1), ('many', 1), ('record', 1), ('region', 1), ('that', 1), ('unprecedented', 1)], [('climate', 1), ('in', 1), ('of', 2), ('and', 2), ('change', 1), ('to', 1), ('historical', 1), ('millions', 1), ('over', 1), ('paleoclimate', 1), ('proxy', 1), ('records', 1), ('thousands', 1), ('years', 1)]]
Creating TF-IDF Corpus
The bag of words approach works fine for converting text to numbers. However, it has one drawback. It assigns a score to a word based on its occurrence in a particular document. It doesn't take into account the fact that the word might also have a high frequency of occurrences in other documents as well. TF-IDF resolves this issue.
The term frequency is calculated as:
Term frequency = (Frequency of the word in a document)/(Total words in the document)
And the Inverse Document Frequency is calculated as:
IDF(word) = Log((Total number of documents)/(Number of documents containing the word))
Using the Gensim library, we can easily create a TF-IDF corpus:
import gensim
from gensim import corpora
from pprint import pprint
text = ["I like to play Football",
"Football is the best game",
"Which game do you like to play ?"]
tokens = [[token for token in sentence.split()] for sentence in text]
gensim_dictionary = corpora.Dictionary()
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]
from gensim import models
import numpy as np
tfidf = models.TfidfModel(gensim_corpus, smartirs='ntc')
for sent in tfidf[gensim_corpus]:
print([[gensim_dictionary[id], np.around(frequency, decimals=2)] for id, frequency in sent])
To find the TF-IDF value, we can use the TfidfModel
class from the models
module of the Gensim library. We simply have to pass the bag of word corpus as a parameter to the constructor of the TfidfModel
class. In the output, you will see all of the words in the three sentences, along with their TF-IDF values:
[['Football', 0.3], ['I', 0.8], ['like', 0.3], ['play', 0.3], ['to', 0.3]]
[['Football', 0.2], ['best', 0.55], ['game', 0.2], ['is', 0.55], ['the', 0.55]]
[['like', 0.17], ['play', 0.17], ['to', 0.17], ['game', 0.17], ['?', 0.47], ['Which', 0.47], ['do', 0.47], ['you', 0.47]]
Downloading Built-In Gensim Models and Datasets
Gensim comes with a variety of built-in datasets and word embedding models that can be directly used.
To download a built-in model or dataset, we can use the downloader
class from the gensim
library. We can then call the load method on the downloader
class to download the desired package. Look at the following code:
import gensim.downloader as api
w2v_embedding = api.load("glove-wiki-gigaword-100")
With the commands above, we download the "glove-wiki-gigaword-100" word embedding model, which is basically based on Wikipedia text and is 100 dimensional. Let's try to find the words similar to "toyota" using our word embedding model. Use the following code to do so:
w2v_embedding.most_similar('toyota')
In the output, you should see the following results:
[('honda', 0.8739858865737915),
('nissan', 0.8108116984367371),
('automaker', 0.7918163537979126),
('mazda', 0.7687169313430786),
('bmw', 0.7616022825241089),
('ford', 0.7547588348388672),
('motors', 0.7539199590682983),
('volkswagen', 0.7176680564880371),
('prius', 0.7156582474708557),
('chrysler', 0.7085398435592651)]
You can see all the results are very relevant to the word "toyota". The number in the fraction corresponds to the similarity index. Higher similarity index means that the word is more relevant.
Conclusion
The Gensim library is one of the most popular Python libraries for NLP. In this article, we briefly explored how the Gensim library can be used to perform tasks like a dictionary and corpus creation. We also saw how to download built-in Gensim modules. In our next article, we will see how to perform topic modeling via the Gensim library.