This is the 21st article in my series of articles on Python for NLP. In the previous article, I explained how to use Facebook's FastText library for finding semantic similarity and to perform text classification. In this article, you will see how to generate text via deep learning technique in Python using the Keras library.
Text generation is one of the state-of-the-art applications of NLP. Deep learning techniques are being used for a variety of text generation tasks such as writing poetry, generating scripts for movies, and even for composing music. However, in this article we will see a very simple example of text generation where given an input string of words, we will predict the next word. We will use the raw text from Shakespeare's famous novel "Macbeth" and will use that to predict the next word given a sequence of input words.
After completing this article, you will be able to perform text generation using the dataset of your choice. So, let's begin without further ado.
Importing Libraries and Dataset
The first step is to import the libraries required to execute the scripts in this article, along with the dataset. The following code imports the required libraries:
import numpy as np
from keras.models import Sequential, load_model
from keras.layers import Dense, Embedding, LSTM, Dropout
from keras.utils import to_categorical
from random import randint
import re
The next step is to download the dataset. We will use Python's NLTK library to download the dataset. We will be using the Gutenberg Dataset, which contains 3036 English books written by 142 authors, including the "Macbeth" by Shakespeare.
The following script downloads the Gutenberg dataset and prints the names of all the files in the dataset.
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg as gut
print(gut.fileids())
You should see the following output:
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
The file shakespeare-macbeth.txt
contains raw text for the novel "Macbeth". To read the text from this file, the raw
method from the gutenberg
class can be used:
macbeth_text = nltk.corpus.gutenberg.raw('shakespeare-macbeth.txt')
Let's print the first 500 characters from out dataset:
print(macbeth_text[:500])
Here is the output:
[The Tragedie of Macbeth by William Shakespeare 1603]
Actus Primus. Scoena Prima.
Thunder and Lightning. Enter three Witches.
1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
2. When the Hurley-burley's done,
When the Battaile's lost, and wonne
3. That will be ere the set of Sunne
1. Where the place?
2. Vpon the Heath
3. There to meet with Macbeth
1. I come, Gray-Malkin
All. Padock calls anon: faire is foule, and foule is faire,
Houer through
You can see that the text contains many special characters and numbers. The next step is to clean the dataset.
Data Preprocessing
To remove the punctuations and special characters, we will define a function named preprocess_text()
:
def preprocess_text(sen):
# Remove punctuations and numbers
sentence = re.sub('[^a-zA-Z]', ' ', sen)
# Single character removal
sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)
# Removing multiple spaces
sentence = re.sub(r'\s+', ' ', sentence)
return sentence.lower()
The preprocess_text
function accepts a text string as a parameter and returns a cleaned text string in lower case.
Let's now clean our text and again print the first 500 characters:
macbeth_text = preprocess_text(macbeth_text)
macbeth_text[:500]
Here is the output:
the tragedie of macbeth by william shakespeare actus primus scoena prima thunder and lightning enter three witches when shall we three meet againe in thunder lightning or in raine when the hurley burley done when the battaile lost and wonne that will be ere the set of sunne where the place vpon the heath there to meet with macbeth come gray malkin all padock calls anon faire is foule and foule is faire houer through the fogge and filthie ayre exeunt scena secunda alarum within enter king malcom
Convert Words to Numbers
Deep learning models are based on statistical algorithms. Hence, in order to work with deep learning models, we need to convert words to numbers.
In this article, we will be using a very simple approach where words will be converted into single integers. Before we could convert words to integers, we need to tokenize our text into individual words. To do so, the word_tokenize()
method from the nltk.tokenize
module can be used.
The following script tokenizes the text in our dataset and then prints the total number of words in the dataset, as well as the total number of unique words in the dataset:
from nltk.tokenize import word_tokenize
macbeth_text_words = (word_tokenize(macbeth_text))
n_words = len(macbeth_text_words)
unique_words = len(set(macbeth_text_words))
print('Total Words: %d' % n_words)
print('Unique Words: %d' % unique_words)
The output looks like this:
Total Words: 17250
Unique Words: 3436
Our text has 17250 words in total, out of which 3436 words are unique. To convert tokenized words to numbers, the Tokenizer
class from the keras.preprocessing.text
module can be used. You need to call the fit_on_texts
method and pass it the list of words. A dictionary will be created where the keys will represent words, whereas integers will represent the corresponding values of the dictionary.
Look at the following script:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=3437)
tokenizer.fit_on_texts(macbeth_text_words)
To access the dictionary that contains words and their corresponding indexes, the word_index
attribute of the tokenizer
object can be used:
vocab_size = len(tokenizer.word_index) + 1
word_2_index = tokenizer.word_index
If you check the length of the dictionary, it will contain 3436 words, which is the total number of unique words in our dataset.
Let's now print the 500th unique word along with its integer value from the word_2_index
dictionary.
print(macbeth_text_words[500])
print(word_2_index[macbeth_text_words[500]])
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Here is the output:
comparisons
1456
Here the word "comparisons" is assigned the integer value of 1456.
Modifying the Shape of the Data
Text generation falls in the category of many-to-one sequence problems since the input is a sequence of words and output is a single word. We will be using the Long Short-Term Memory Network (LSTM), which is a type of recurrent neural network to create our text generation model. LSTM accepts data in a 3-dimensional format (number of samples, number of time-steps, features per time-step). Since the output will be a single word, the shape of the output will be 2-dimensional (number of samples, number of unique words in the corpus).
The following script modifies the shape of the input sequences and the corresponding outputs.
input_sequence = []
output_words = []
input_seq_length = 100
for i in range(0, n_words - input_seq_length , 1):
in_seq = macbeth_text_words[i:i + input_seq_length]
out_seq = macbeth_text_words[i + input_seq_length]
input_sequence.append([word_2_index[word] for word in in_seq])
output_words.append(word_2_index[out_seq])
In the script above, we declare two empty lists input_sequence
and output_words
. The input_seq_length
is set to 100, which means that our input sequence will consist of 100 words. Next, we execute a loop where in the first iteration, integer values for the first 100 words from the text are appended to the input_sequence
list. The 101st word is appended to the output_words
list. During the second iteration, a sequence of words that starts from the 2nd word in the text and ends at the 101st word is stored in the input_sequence
list, and the 102nd word is stored in the output_words
array, and so on. A total of 17150 input sequences will be generated since there are 17250 total words in the dataset (100 less than the total words).
Let's now print the value of the first sequence in the input_sequence
list:
print(input_sequence[0])
Output:
[1, 869, 4, 40, 60, 1358, 1359, 408, 1360, 1361, 409, 265, 2, 870, 31, 190, 291, 76, 36, 30, 190, 327, 128, 8, 265, 870, 83, 8, 1362, 76, 1, 1363, 1364, 86, 76, 1, 1365, 354, 2, 871, 5, 34, 14, 168, 1, 292, 4, 649, 77, 1, 220, 41, 1, 872, 53, 3, 327, 12, 40, 52, 1366, 1367, 25, 1368, 873, 328, 355, 9, 410, 2, 410, 9, 355, 1369, 356, 1, 1370, 2, 874, 169, 103, 127, 411, 357, 149, 31, 51, 1371, 329, 107, 12, 358, 412, 875, 1372, 51, 20, 170, 92, 9]
Let's normalize our input sequences by dividing the integers in the sequences by the largest integer value. The following script also converts the output into 2-dimensional format.
X = np.reshape(input_sequence, (len(input_sequence), input_seq_length, 1))
X = X / float(vocab_size)
y = to_categorical(output_words)
The following script prints the shape of the inputs and the corresponding outputs.
print("X shape:", X.shape)
print("y shape:", y.shape)
Output:
X shape: (17150, 100, 1)
y shape: (17150, 3437)
Training the Model
The next step is to train our model. There is no hard and fast rule as to what number of layers and neurons should be used to train the model. We will randomly select the layer and neuron sizes. You can play around with the hyper parameters to see if you can get better results.
We will create three LSTM layers with 800 neurons each. A final dense layer with 1 neuron will be added to predict the index of the next word, as shown below:
model = Sequential()
model.add(LSTM(800, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(LSTM(800, return_sequences=True))
model.add(LSTM(800))
model.add(Dense(y.shape[1], activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='adam')
Since the output word can be one of 3436 unique words, our problem is a multi-class classification problem, hence the categorical_crossentropy
loss function is used. In case of binary classification, the binary_crossentropy
function is used. Once you execute the above script, you should see the model summary:
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 100, 800) 2566400
_________________________________________________________________
lstm_2 (LSTM) (None, 100, 800) 5123200
_________________________________________________________________
lstm_3 (LSTM) (None, 800) 5123200
_________________________________________________________________
dense_1 (Dense) (None, 3437) 2753037
=================================================================
Total params: 15,565,837
Trainable params: 15,565,837
Non-trainable params: 0
To train the model, we can simply use the fit()
method.
model.fit(X, y, batch_size=64, epochs=10, verbose=1)
Here again, you can play around with different values for batch_size
and the epochs
. The model can take some time to train.
Making Predictions
To make predictions, we will randomly select a sequence from the input_sequence
list, convert it into a 3-dimensional shape and then pass it to the predict()
method of the trained model. The model will return a one-hot encoded array where the index that contains 1 will be the index value of the next word. The index value is then passed to the index_2_word
dictionary, where the word index is used as a key. The index_2_word
dictionary will return the word that belongs to the index that is passed as a key to the dictionary.
The following script randomly selects a sequence of integers and then prints the corresponding sequence of words:
random_seq_index = np.random.randint(0, len(input_sequence)-1)
random_seq = input_sequence[random_seq_index]
index_2_word = dict(map(reversed, word_2_index.items()))
word_sequence = [index_2_word[value] for value in random_seq]
print(' '.join(word_sequence))
For the script in this article, the following sequence was randomly selected. The sequence generated for you will most likely be different than this one:
amen when they did say god blesse vs lady consider it not so deepely mac but wherefore could not pronounce amen had most need of blessing and amen stuck in my throat lady these deeds must not be thought after these wayes so it will make vs mad macb me thought heard voyce cry sleep no more macbeth does murther sleepe the innocent sleepe sleepe that knits vp the rauel sleeue of care the death of each dayes life sore labors bath balme of hurt mindes great natures second course chiefe nourisher in life feast lady what doe you meane
In the above script, the index_2_word
dictionary is created by simply reversing the word_2_index
dictionary. In this case, reversing a dictionary refers to the process of swapping keys with values.
Next, we will print the next 100 words that follow the above sequence of words:
for i in range(100):
int_sample = np.reshape(random_seq, (1, len(random_seq), 1))
int_sample = int_sample / float(vocab_size)
predicted_word_index = model.predict(int_sample, verbose=0)
predicted_word_id = np.argmax(predicted_word_index)
seq_in = [index_2_word[index] for index in random_seq]
word_sequence.append(index_2_word[ predicted_word_id])
random_seq.append(predicted_word_id)
random_seq = random_seq[1:len(random_seq)]
The word_sequence
variable now contains our input sequence of words, along with the next 100 predicted words. The word_sequence
variable contains a sequence of words in the form of a list. We can simply join the words in the list to get the final output sequence, as shown below:
final_output = ""
for word in word_sequence:
final_output = final_output + " " + word
print(final_output)
Here is the final output:
amen when they did say god blesse vs lady consider it not so deepely mac but wherefore could not pronounce amen had most need of blessing and amen stuck in my throat lady these deeds must not be thought after these wayes so it will make vs mad macb me thought heard voyce cry sleep no more macbeth does murther sleepe the innocent sleepe sleepe that knits vp the rauel sleeue of care the death of each dayes life sore labors bath balme of hurt mindes great natures second course chiefe nourisher in life feast lady what doe you meane and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and
The output doesn't look very good yet and it seems that our model is only learning from the last word i.e. and
. However, you get the idea about how to create a text generation model with Keras. To improve the results, I have the following recommendations for you:
- Change the hyper parameters, including the size and number of LSTM layers and number of epochs to see if you get better results.
- Try to remove the stop words like
is
,am
,are
from the training set to generate words other than stop words in the test set (although this will depend on the type of application). - Create a character-level text generation model that predicts the next
N
characters.
To practice further, I would recommend that you try to develop a text generation model with the other datasets from the Gutenberg corpus.
Conclusion
In this article, we saw how to create a text generation model using deep learning with Python's Keras library. Though the model developed in this article is not perfect, the article conveys the idea of how to generate text with deep learning.