This is the 21st article in my series of articles on Python for NLP. In the previous article, I explained how to use Facebook's FastText library for finding semantic similarity and to perform text classification. In this article, you will see how to generate text via deep learning technique in Python using the Keras library.
Text generation is one of the state-of-the-art applications of NLP. Deep learning techniques are being used for a variety of text generation tasks such as writing poetry, generating scripts for movies, and even for composing music. However, in this article we will see a very simple example of text generation where given an input string of words, we will predict the next word. We will use the raw text from Shakespeare's famous novel "Macbeth" and will use that to predict the next word given a sequence of input words.
After completing this article, you will be able to perform text generation using the dataset of your choice. So, let's begin without further ado.
Importing Libraries and Dataset
The first step is to import the libraries required to execute the scripts in this article, along with the dataset. The following code imports the required libraries:
import numpy as np from keras.models import Sequential, load_model from keras.layers import Dense, Embedding, LSTM, Dropout from keras.utils import to_categorical from random import randint import re
The next step is to download the dataset. We will use Python's NLTK library to download the dataset. We will be using the Gutenberg Dataset, which contains 3036 English books written by 142 authors, including the "Macbeth" by Shakespeare.
The following script downloads the Gutenberg dataset and prints the names of all the files in the dataset.
import nltk nltk.download('gutenberg') from nltk.corpus import gutenberg as gut print(gut.fileids())
You should see the following output:
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
shakespeare-macbeth.txt contains raw text for the novel "Macbeth". To read the text from this file, the
raw method from the
gutenberg class can be used:
macbeth_text = nltk.corpus.gutenberg.raw('shakespeare-macbeth.txt')
Let's print the first 500 characters from out dataset:
Here is the output:
[The Tragedie of Macbeth by William Shakespeare 1603] Actus Primus. Scoena Prima. Thunder and Lightning. Enter three Witches. 1. When shall we three meet againe? In Thunder, Lightning, or in Raine? 2. When the Hurley-burley's done, When the Battaile's lost, and wonne 3. That will be ere the set of Sunne 1. Where the place? 2. Vpon the Heath 3. There to meet with Macbeth 1. I come, Gray-Malkin All. Padock calls anon: faire is foule, and foule is faire, Houer through
You can see that the text contains many special characters and numbers. The next step is to clean the dataset.
To remove the punctuations and special characters, we will define a function named
def preprocess_text(sen): # Remove punctuations and numbers sentence = re.sub('[^a-zA-Z]', ' ', sen) # Single character removal sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence) # Removing multiple spaces sentence = re.sub(r'\s+', ' ', sentence) return sentence.lower()
preprocess_text function accepts a text string as a parameter and returns a cleaned text string in lower case.
Let's now clean our text and again print the first 500 characters:
macbeth_text = preprocess_text(macbeth_text) macbeth_text[:500]
Here is the output:
the tragedie of macbeth by william shakespeare actus primus scoena prima thunder and lightning enter three witches when shall we three meet againe in thunder lightning or in raine when the hurley burley done when the battaile lost and wonne that will be ere the set of sunne where the place vpon the heath there to meet with macbeth come gray malkin all padock calls anon faire is foule and foule is faire houer through the fogge and filthie ayre exeunt scena secunda alarum within enter king malcom
Convert Words to Numbers
Deep learning models are based on statistical algorithms. Hence, in order to work with deep learning models, we need to convert words to numbers.
In this article, we will be using a very simple approach where words will be converted into single integers. Before we could convert words to integers, we need to tokenize our text into individual words. To do so, the
word_tokenize() method from the
nltk.tokenize module can be used.
The following script tokenizes the text in our dataset and then prints the total number of words in the dataset, as well as the total number of unique words in the dataset:
from nltk.tokenize import word_tokenize macbeth_text_words = (word_tokenize(macbeth_text)) n_words = len(macbeth_text_words) unique_words = len(set(macbeth_text_words)) print('Total Words: %d' % n_words) print('Unique Words: %d' % unique_words)
The output looks like this:
Total Words: 17250 Unique Words: 3436
Our text has 17250 words in total, out of which 3436 words are unique. To convert tokenized words to numbers, the
Tokenizer class from the
keras.preprocessing.text module can be used. You need to call the
fit_on_texts method and pass it the list of words. A dictionary will be created where the keys will represent words, whereas integers will represent the corresponding values of the dictionary.
Look at the following script:
from keras.preprocessing.text import Tokenizer tokenizer = Tokenizer(num_words=3437) tokenizer.fit_on_texts(macbeth_text_words)
To access the dictionary that contains words and their corresponding indexes, the
word_index attribute of the tokenizer object can be used:
vocab_size = len(tokenizer.word_index) + 1 word_2_index = tokenizer.word_index
If you check the length of the dictionary, it will contain 3436 words, which is the total number of unique words in our dataset.
Let's now print the 500th unique word along with its integer value from the
Here is the output:
Here the word "comparisons" is assigned the integer value of 1456.
Modifying the Shape of the Data
Text generation falls in the category of many-to-one sequence problems since the input is a sequence of words and output is a single word. We will be using the Long Short-Term Memory Network (LSTM), which is a type of recurrent neural network to create our text generation model. LSTM accepts data in a 3-dimensional format (number of samples, number of time-steps, features per time-step). Since the output will be a single word, the shape of the output will be 2-dimensional (number of samples, number of unique words in the corpus).
The following script modifies the shape of the input sequences and the corresponding outputs.
input_sequence =  output_words =  input_seq_length = 100 for i in range(0, n_words - input_seq_length , 1): in_seq = macbeth_text_words[i:i + input_seq_length] out_seq = macbeth_text_words[i + input_seq_length] input_sequence.append([word_2_index[word] for word in in_seq]) output_words.append(word_2_index[out_seq])
In the script above, we declare two empty lists
input_seq_length is set to 100, which means that our input sequence will consist of 100 words. Next, we execute a loop where in the first iteration, integer values for the first 100 words from the text are appended to the
input_sequence list. The 101st word is appended to the
output_words list. During the second iteration, a sequence of words that starts from the 2nd word in the text and ends at the 101st word is stored in the
input_sequence list, and the 102nd word is stored in the
output_words array, and so on. A total of 17150 input sequences will be generated since there are 17250 total words in the dataset (100 less than the total words).
Let's now print the value of the first sequence in the
[1, 869, 4, 40, 60, 1358, 1359, 408, 1360, 1361, 409, 265, 2, 870, 31, 190, 291, 76, 36, 30, 190, 327, 128, 8, 265, 870, 83, 8, 1362, 76, 1, 1363, 1364, 86, 76, 1, 1365, 354, 2, 871, 5, 34, 14, 168, 1, 292, 4, 649, 77, 1, 220, 41, 1, 872, 53, 3, 327, 12, 40, 52, 1366, 1367, 25, 1368, 873, 328, 355, 9, 410, 2, 410, 9, 355, 1369, 356, 1, 1370, 2, 874, 169, 103, 127, 411, 357, 149, 31, 51, 1371, 329, 107, 12, 358, 412, 875, 1372, 51, 20, 170, 92, 9]
Let's normalize our input sequences by dividing the integers in the sequences by the largest integer value. The following script also converts the output into 2-dimensional format.
X = np.reshape(input_sequence, (len(input_sequence), input_seq_length, 1)) X = X / float(vocab_size) y = to_categorical(output_words)
The following script prints the shape of the inputs and the corresponding outputs.
print("X shape:", X.shape) print("y shape:", y.shape)
X shape: (17150, 100, 1) y shape: (17150, 3437)
Training the Model
The next step is to train our model. There is no hard and fast rule as to what number of layers and neurons should be used to train the model. We will randomly select the layer and neuron sizes. You can play around with the hyper parameters to see if you can get better results.
We will create three LSTM layers with 800 neurons each. A final dense layer with 1 neuron will be added to predict the index of the next word, as shown below:
model = Sequential() model.add(LSTM(800, input_shape=(X.shape, X.shape), return_sequences=True)) model.add(LSTM(800, return_sequences=True)) model.add(LSTM(800)) model.add(Dense(y.shape, activation='softmax')) model.summary() model.compile(loss='categorical_crossentropy', optimizer='adam')
Since the output word can be one of 3436 unique words, our problem is a multi-class classification problem, hence the
categorical_crossentropy loss function is used. In case of binary classification, the
binary_crossentropy function is used. Once you execute the above script, you should see the model summary:
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= lstm_1 (LSTM) (None, 100, 800) 2566400 _________________________________________________________________ lstm_2 (LSTM) (None, 100, 800) 5123200 _________________________________________________________________ lstm_3 (LSTM) (None, 800) 5123200 _________________________________________________________________ dense_1 (Dense) (None, 3437) 2753037 ================================================================= Total params: 15,565,837 Trainable params: 15,565,837 Non-trainable params: 0
To train the model, we can simply use the
model.fit(X, y, batch_size=64, epochs=10, verbose=1)
Here again, you can play around with different values for
batch_size and the
epochs. The model can take some time to train.
To make predictions, we will randomly select a sequence from the
input_sequence list, convert it into a 3-dimentional shape and then pass it to the
predict() method of the trained model. The model will return a one-hot encoded array where the index that contains 1 will be the index value of the next word. The index value is then passed to the
index_2_word dictionary, where the word index is used as a key. The
index_2_word dictionary will return the word that belong to the index that is passed as a key to the dictionary.
The following script randomly selects a sequence of integers and then prints the corresponding sequence of words:
random_seq_index = np.random.randint(0, len(input_sequence)-1) random_seq = input_sequence[random_seq_index] index_2_word = dict(map(reversed, word_2_index.items())) word_sequence = [index_2_word[value] for value in random_seq] print(' '.join(word_sequence))
For the script in this article, the following sequence was randomly selected. The sequence generated for you will most likely be different than this one:
amen when they did say god blesse vs lady consider it not so deepely mac but wherefore could not pronounce amen had most need of blessing and amen stuck in my throat lady these deeds must not be thought after these wayes so it will make vs mad macb me thought heard voyce cry sleep no more macbeth does murther sleepe the innocent sleepe sleepe that knits vp the rauel sleeue of care the death of each dayes life sore labors bath balme of hurt mindes great natures second course chiefe nourisher in life feast lady what doe you meane
In the above script, the
index_2_word dictionary is created by simply reversing the
word_2_index dictionary. In this case, reversing a dictionary refers to the process of swapping keys with values.
Next, we will print the next 100 words that follow the above sequence of words:
for i in range(100): int_sample = np.reshape(random_seq, (1, len(random_seq), 1)) int_sample = int_sample / float(vocab_size) predicted_word_index = model.predict(int_sample, verbose=0) predicted_word_id = np.argmax(predicted_word_index) seq_in = [index_2_word[index] for index in random_seq] word_sequence.append(index_2_word[ predicted_word_id]) random_seq.append(predicted_word_id) random_seq = random_seq[1:len(random_seq)]
word_sequence variable now contains our input sequence of words, along with the next 100 predicted words. The
word_sequence variable contains sequence of words in the form of list. We can simply join the words in the list to get the final output sequence, as shown below:
final_output = "" for word in word_sequence: final_output = final_output + " " + word print(final_output)
Here is the final output:
amen when they did say god blesse vs lady consider it not so deepely mac but wherefore could not pronounce amen had most need of blessing and amen stuck in my throat lady these deeds must not be thought after these wayes so it will make vs mad macb me thought heard voyce cry sleep no more macbeth does murther sleepe the innocent sleepe sleepe that knits vp the rauel sleeue of care the death of each dayes life sore labors bath balme of hurt mindes great natures second course chiefe nourisher in life feast lady what doe you meane and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and
The output doesn't look very good yet and it seems that our model is only learning from the last word i.e.
and. However, you get the idea about how to create a text generation model with Keras. To improve the results, I have the following recommendations for you:
- Change the hyper parameters, including the size and number of LSTM layers and number of epochs to see if you get better results.
- Try to remove the stop words like
arefrom training set to generate words other than stop words in the test set (although this will depend on the type of application).
- Create a character-level text generation model that predicts the next
To practice further, I would recommend that you try to develop a text generation model with the other datasets from the Gutenberg corpus.
In this article, we saw how to create a text generation model using deep learning with Python's Keras library. Though the model developed in this article is not perfect, the article conveys the idea of how to generate text with deep learning.