Transformers, even though released in 2017, have only started gaining significant traction in the last couple of years. With the proliferation of the technology through platforms like HuggingFace, NLP and Large Language Models (LLMs) have become more accessible than ever.
Yet - even with all the hype around them and with many theory-oriented guides, there aren't many custom implementations online, and the resources aren't as readily available as with some other network types that have been around for longer. While you could simplify your work cycle by using a pre-built Transformer from HuggingFace (the topic of another guide) - you can get to feel how it works by building one yourself, before abstracting it away through a library. We'll be focusing on building, rather than theory and optimization here.
In this guide, we'll be building an Autoregressive Language Model to generate text. We'll be focusing on the practical and minimalistic/concise aspects of loading data, splitting it, vectorizing it, building a model, writing a custom callback and training/inference. Each of these tasks can be spun off into more detailed guides, so we'll keep the implementation as a generic one, leaving room for customization and optimization depending on your own dataset.
Types of LLMs and GPT-Fyodor
While categorization can get much more intricate - you can broadly categorize Transformer-based language models into three categories:
- Encoder-Based Models - ALBERT, BERT, DistilBERT, RoBERTa
- Decoder-Based - GPT, GPT-2, GPT-3, TransformerXL
- Seq2Seq Models - BART, mBART, T5
Encoder-based models only use a Transformer encoder in their architecture (typically, stacked) and are great for understanding sentences (classification, named entity recognition, question answering).
Decoder-based models only use a Transformer decoder in their architecture (also typically stacked) and are great for future prediction, which makes them suitable for text generation.
Seq2Seq models combine both encoders and decoders and are great at text generation, summarization and most importantly - translation.
The GPT family of models, which gained a lot of traction in the past couple of years, are decoder-based transformer models, and are great at producing human-like text, trained on large corpora of data, and given a prompt as a new starting seed for generation. For instance:
generate_text('the truth ultimately is')
Which under the hood feeds this prompt into a GPT-like model, and produces:
'the truth ultimately is really a joy in history, this state of life through which is almost invisible, superfluous teleological...'
This is, in fact, a small spoiler from the end of the guide! Another small spoiler is the architecture that produced that text:
inputs = layers.Input(shape=(maxlen,))
embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(vocab_size, maxlen, embed_dim)(inputs)
transformer_block = keras_nlp.layers.TransformerDecoder(embed_dim, num_heads)(embedding_layer)
outputs = layers.Dense(vocab_size, activation='softmax')(transformer_block)
model = keras.Model(inputs=inputs, outputs=outputs)
5 lines is all it takes to build a decoder-only transformer model - simulating a small GPT. Since we'll be training the model on Fyodor Dostoyevsky's novels (which you can substitute with anything else, from Wikipedia to Reddit comments) - we'll tentatively call the model GPT-Fyodor.
KerasNLP
The trick to a 5-line GPT-Fyodor lies in KerasNLP, which is developed by the official Keras team, as a horizontal extension to Keras, which in true Keras fashion, aims to bring industry-strength NLP to your fingertips, with new layers (encoders, decoders, token embeddings, position embeddings, metrics, tokenizers, etc.).
KerasNLP isn't a model zoo. It's a part of Keras (as a separate package), that lowers the barrier to entry for NLP model development, just as it lowers the barrier to entry for general deep learning development with the main package.
Note: As of writing KerasNLP is still being produced, and in early stages. Subtle differences might be present in future versions. The writeup is utilizing version 0.3.0
.
To be able to use KerasNLP, you'll have to install it via pip
:
$ pip install keras_nlp
And you can verify the version with:
keras_nlp.__version__
# 0.3.0
Implementing a GPT-Style Model with Keras
Let's start out by importing the libraries we'll be using - TensorFlow, Keras, KerasNLP and NumPy:
import tensorflow as tf
from tensorflow import keras
import keras_nlp
import numpy as np
Loading Data
Let's load in a few of Dostoyevsky's novels - one would be way too short for a model to fit, without a fair bit of overfitting from the early stages onward. We'll be gracefully using the raw text files from Project Gutenberg, due to the simplicity of working with such data:
crime_and_punishment_url = 'https://www.gutenberg.org/files/2554/2554-0.txt'
brothers_of_karamazov_url = 'https://www.gutenberg.org/files/28054/28054-0.txt'
the_idiot_url = 'https://www.gutenberg.org/files/2638/2638-0.txt'
the_possessed_url = 'https://www.gutenberg.org/files/8117/8117-0.txt'
paths = [crime_and_punishment_url, brothers_of_karamazov_url, the_idiot_url, the_possessed_url]
names = ['Crime and Punishment', 'Brothers of Karamazov', 'The Idiot', 'The Possessed']
texts = ''
for index, path in enumerate(paths):
filepath = keras.utils.get_file(f'{names[index]}.txt', origin=path)
text = ''
with open(filepath, encoding='utf-8') as f:
text = f.read()
# First 50 lines are the Gutenberg intro and preface
# Skipping first 10k characters for each book should be approximately
# removing the intros and prefaces.
texts += text[10000:]
We've simply downloaded all of the files, gone through them and concatenated them one on top of the other. This includes some diversity in the language used, while still keeping it distinctly Fyodor! For each file, we've skipped the first 10k characters, which is around the average length of the preface and Gutenberg intro, so we're left with a largely intact body of the book for each iteration. Let's take a look at some random 500 characters in the texts
string now:
# 500 characters
texts[25000:25500]
'nd that was why\nI addressed you at once. For in unfolding to you the story of my life, I\ndo not wish to make myself a laughing-stock before these idle listeners,\nwho indeed know all about it already, but I am looking for a man\nof feeling and education. Know then that my wife was educated in a\nhigh-class school for the daughters of noblemen, and on leaving she\ndanced the shawl dance before the governor and other personages for\nwhich she was presented with a gold medal and a certificate of merit.\n'
Let's separate the string into sentences before doing any other processing:
text_list = texts.split('.')
len(text_list) # 69181
We've got 69k sentences. When you replace the \n
characters with whitespaces and count the words:
len(texts.replace('\n', ' ').split(' ')) # 1077574
Note: You'll generally want to have at least a million words in a dataset, and ideally, much much more than that. We're working with a few megabytes of data (~5MB) while language models are more commonly trained on tens of gigabytes of text. This will, naturally, make it really easy to overfit the text input and hard to generalize (high perplexity without overfitting, or low perplexity with a lot of overfitting). Take the results with a grain of salt.
Nevertheless, let's split these into a training, test and validation set. First, let's remove the empty strings and shuffle the sentences:
# Filter out empty strings ('') that are to be found commonly due to the book's format
text_list = list(filter(None, text_list))
import random
random.shuffle(text_list)
Then, we'll do a 70/15/15 split:
length = len(text_list)
text_train = text_list[:int(0.7*length)]
text_test = text_list[int(0.7*length):int(0.85*length)]
text_valid = text_list[int(0.85*length):]
This is a simple, yet effective way to perform a train-test-validation split. Let's take a peek at text_train
:
[' It was a dull morning, but the snow had ceased',
'\n\n"Pierre, you who know so much of what goes on here, can you really have\nknown nothing of this business and have heard nothing about it?"\n\n"What? What a set! So it\'s not enough to be a child in your old age,\nyou must be a spiteful child too! Varvara Petrovna, did you hear what he\nsaid?"\n\nThere was a general outcry; but then suddenly an incident took place\nwhich no one could have anticipated', ...
Time for standardization and vectorization!
Text Vectorization
Networks don't understand words - they understand numbers. We'll want to "tokenize" the words:
...
sequence = ['I', 'am', 'Wall-E']
sequence = tokenize(sequence)
print(sequence) # [4, 26, 472]
...
Also, since sentences differ in length - padding is typically added to the left or right to ensure the same shape across sentences being fed in. Say our longest sentence is 5-words (tokens) long. In that case, the Wall-E sentence would be padded by two zeros so we ensure the same input shape:
sequence = pad_sequence(sequence)
print(sequence) # [4, 26, 472, 0, 0]
Traditionally, this was done using a TensorFlow Tokenizer
and Keras' pad_sequences()
methods - however, a much handier layer, TextVectorization
, can be used, which "tokenizes" and pads your input, allowing you to extract the vocabulary and its size, without knowing the vocab upfront!
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Let's adapt and fit a TextVectorization
layer:
from tensorflow.keras.layers import TextVectorization
def custom_standardization(input_string):
sentence = tf.strings.lower(input_string)
sentence = tf.strings.regex_replace(sentence, "\n", " ")
return sentence
maxlen = 50
# You can also set calculate the longest sentence in the data - 25 in this case
# maxlen = len(max(text_list).split(' '))
vectorize_layer = TextVectorization(
standardize = custom_standardization,
output_mode="int",
output_sequence_length=maxlen + 1,
)
vectorize_layer.adapt(text_list)
vocab = vectorize_layer.get_vocabulary()
The custom_standardization()
method can get a lot longer than this. We've simply lower-cased all input and replaced \n
with " "
. This is where you can really put in most of your preprocessing for text - and supply it to the vectorization layer through the optional standardize
argument. Once you adapt()
the layer to the text (NumPy array or list of texts) - you can get the vocabulary, as well as its size from there:
vocab_size = len(vocab)
vocab_size # 49703
Finally, to "de-tokenize" words, we'll create an index_lookup
dictionary:
index_lookup = dict(zip(range(len(vocab)), vocab))
index_lookup[5] # of
It maps all of the tokens ([1, 2, 3, 4, ...]
) to words in the vocabulary (['a', 'the', 'i', ...]
). By passing in a key (token index), we can easily get the word back. You can now run the vectorize_layer()
on any input and observe the vectorized sentences:
vectorize_layer(['hello world!'])
Which results in:
<tf.Tensor: shape=(1, 51), dtype=int64, numpy=
array([[ 1, 7509, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0]], dtype=int64)>
Hello has the index of 1
while world has the index of 7509
! The rest is the padding to the maxlen
we've calculated.
We have the means to vectorize text - now, let's create datasets from text_train
, text_test
and text_valid
, using our vectorization layer as a conversion medium between words and vectors that can be fed into GPT-Fyodor.
Dataset Creation
We'll be creating a tf.data.Dataset
for each of our sets, using from_tensor_slices()
and providing a list of, well, tensor slices (sentences):
batch_size = 64
train_dataset = tf.data.Dataset.from_tensor_slices(text_train)
train_dataset = train_dataset.shuffle(buffer_size=256)
train_dataset = train_dataset.batch(batch_size)
test_dataset = tf.data.Dataset.from_tensor_slices(text_test)
test_dataset = test_dataset.shuffle(buffer_size=256)
test_dataset = test_dataset.batch(batch_size)
valid_dataset = tf.data.Dataset.from_tensor_slices(text_valid)
valid_dataset = valid_dataset.shuffle(buffer_size=256)
valid_dataset = valid_dataset.batch(batch_size)
Once created and shuffled (again, for good measure) - we can apply a preprocessing (vectorization and sequence splitting) function:
def preprocess_text(text):
text = tf.expand_dims(text, -1)
tokenized_sentences = vectorize_layer(text)
x = tokenized_sentences[:, :-1]
y = tokenized_sentences[:, 1:]
return x, y
train_dataset = train_dataset.map(preprocess_text)
train_dataset = train_dataset.prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.map(preprocess_text)
test_dataset = test_dataset.prefetch(tf.data.AUTOTUNE)
valid_dataset = valid_dataset.map(preprocess_text)
valid_dataset = valid_dataset.prefetch(tf.data.AUTOTUNE)
The preprocess_text()
function simply expands by the last dimension, "vectorizes" the text using our vectorize_layer
and creates the inputs and targets, offset by a single token. The model will use [0..n]
to infer n+1
, yielding a prediction for each word, accounting for all of the words before that. Let's take a look at a single entry in any of the datasets:
for entry in train_dataset.take(1):
print(entry)
Investigating the returned inputs and targets, in batches of 64 (with a length of 30 each), we can clearly see how they're offset by one:
(<tf.Tensor: shape=(64, 50), dtype=int64, numpy=
array([[17018, 851, 2, ..., 0, 0, 0],
[ 330, 74, 4, ..., 0, 0, 0],
[ 68, 752, 30273, ..., 0, 0, 0],
...,
[ 7, 73, 2004, ..., 0, 0, 0],
[ 44, 42, 67, ..., 0, 0, 0],
[ 195, 252, 102, ..., 0, 0, 0]], dtype=int64)>, <tf.Tensor: shape=(64, 50), dtype=int64, numpy=
array([[ 851, 2, 8289, ..., 0, 0, 0],
[ 74, 4, 34, ..., 0, 0, 0],
[ 752, 30273, 7514, ..., 0, 0, 0],
...,
[ 73, 2004, 31, ..., 0, 0, 0],
[ 42, 67, 76, ..., 0, 0, 0],
[ 252, 102, 8596, ..., 0, 0, 0]], dtype=int64)>)
Finally - it's time to build the model!
Model Definition
We'll make use of KerasNLP layers here. After an Input
, we'll encode the input through a TokenAndPositionEmbedding
layer, passing in our vocab_size
, maxlen
and embed_dim
. The same embed_dim
that this layer outputs and inputs into the TransformerDecoder
will be retained in the Decoder. As of writing, the Decoder automatically maintains the input dimensionality, and doesn't allow you to project it into a different output, but it does let you define the latent dimensions through the intermediate_dim
argument.
We'll multiply the embedding dimensions by two for the latent representation, but you can keep it the same or use a number detached from the embedding dims:
embed_dim = 128
num_heads = 4
def create_model():
inputs = keras.layers.Input(shape=(maxlen,), dtype=tf.int32)
embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(vocab_size, maxlen, embed_dim)(inputs)
decoder = keras_nlp.layers.TransformerDecoder(intermediate_dim=embed_dim,
num_heads=num_heads,
dropout=0.5)(embedding_layer)
outputs = keras.layers.Dense(vocab_size, activation='softmax')(decoder)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(
optimizer="adam",
loss='sparse_categorical_crossentropy',
metrics=[keras_nlp.metrics.Perplexity(), 'accuracy']
)
return model
model = create_model()
model.summary()
On top of the decoder, we have a Dense
layer to choose the next word in the sequence, with a softmax
activation (which produces the probability distribution for each next token). Let's take a look at the summary of the model:
Model: "model_5"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_6 (InputLayer) [(None, 30)] 0
token_and_position_embeddin (None, 30, 128) 6365824
g_5 (TokenAndPositionEmbedd
ing)
transformer_decoder_5 (Tran (None, 30, 128) 132480
sformerDecoder)
dense_5 (Dense) (None, 30, 49703) 6411687
=================================================================
Total params: 13,234,315
Trainable params: 13,234,315
Non-trainable params: 0
_________________________________________________________________
GPT-2 stacks many decoders - GPT-2 Small has 12 stacked decoders (117M params), while GPT-2 Extra Large has 48 stacked decoders (1.5B params). Our single-decoder model with a humble 13M parameters should work well enough for educational purposes. With LLMs - scaling up has proven to be an exceedingly good strategy, and Transformers allow for good scaling, making it feasible to train extremely large models.
GPT-3 has a "meager" 175B parameters. Google Brain's team trained a 1.6T parameter model to perform sparsity research while keeping computation on the same level as much smaller models.
As a matter of fact, if we increased the number of decoders from 1 to 3:
def create_model():
inputs = keras.layers.Input(shape=(maxlen,), dtype=tf.int32)
x = keras_nlp.layers.TokenAndPositionEmbedding(vocab_size, maxlen, embed_dim)(inputs)
for i in range(4):
x = keras_nlp.layers.TransformerDecoder(intermediate_dim=embed_dim*2, num_heads=num_heads, dropout=0.5)(x)
do = keras.layers.Dropout(0.4)(x)
outputs = keras.layers.Dense(vocab_size, activation='softmax')(do)
model = keras.Model(inputs=inputs, outputs=outputs)
Our parameter count would be increased by 400k:
Total params: 13,631,755
Trainable params: 13,631,755
Non-trainable params: 0
Most of the parameters in our network come from the
TokenAndPositionEmbedding
andDense
layers!
Try out different depths of the decoder - from 1 to all the way your machine can handle and note the results. In any case - we're almost ready to train the model! Let's create a custom callback that'll produce a sample of text on each epoch, so we can see how the model learns to form sentences through training.
Custom Callback
class TextSampler(keras.callbacks.Callback):
def __init__(self, start_prompt, max_tokens):
self.start_prompt = start_prompt
self.max_tokens = max_tokens
# Helper method to choose a word from the top K probable words with respect to their probabilities
# in a sequence
def sample_token(self, logits):
logits, indices = tf.math.top_k(logits, k=5, sorted=True)
indices = np.asarray(indices).astype("int32")
preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
preds = np.asarray(preds).astype("float32")
return np.random.choice(indices, p=preds)
def on_epoch_end(self, epoch, logs=None):
decoded_sample = self.start_prompt
for i in range(self.max_tokens-1):
tokenized_prompt = vectorize_layer([decoded_sample])[:, :-1]
predictions = self.model.predict([tokenized_prompt], verbose=0)
# To find the index of the next word in the prediction array.
# The tokenized prompt is already shorter than the original decoded sample
# by one, len(decoded_sample.split()) is two words ahead - so we remove 1 to get
# the next word in the sequence
sample_index = len(decoded_sample.strip().split())-1
sampled_token = self.sample_token(predictions[0][sample_index])
sampled_token = index_lookup[sampled_token]
decoded_sample += " " + sampled_token
print(f"\nSample text:\n{decoded_sample}...\n")
# First 5 words of a random sentence to be used as a seed
random_sentence = ' '.join(random.choice(text_valid).replace('\n', ' ').split(' ')[:4])
sampler = TextSampler(random_sentence, 30)
reducelr = keras.callbacks.ReduceLROnPlateau(patience=10, monitor='val_loss')
Training the Model
Finally, time to train! Let's chuck in our train_dataset
and validation_dataset
with the callbacks in place:
model = create_model()
history = model.fit(train_dataset,
validation_data=valid_dataset,
epochs=10,
callbacks=[sampler, reducelr])
The sampler chose an unfortunate sentence that starts with the end quote and start quote, but it still produces interesting results while training:
# Epoch training
Epoch 1/10
658/658 [==============================] - ETA: 0s - loss: 2.7480 - perplexity: 15.6119 - accuracy: 0.6711
# on_epoch_end() sample generation
Sample text:
" "What do you had not been i had been the same man was not be the same eyes to been a whole man and he did a whole man to the own...
# Validation
658/658 [==============================] - 158s 236ms/step - loss: 2.7480 - perplexity: 15.6119 - accuracy: 0.6711 - val_loss: 2.2130 - val_perplexity: 9.1434 - val_accuracy: 0.6864 - lr: 0.0010
...
Sample text:
" "What do you know it is it all this very much as i should not have a great impression in the room to be able of it in my heart...
658/658 [==============================] - 149s 227ms/step - loss: 1.7753 - perplexity: 5.9019 - accuracy: 0.7183 - val_loss: 2.0039 - val_perplexity: 7.4178 - val_accuracy: 0.7057 - lr: 0.0010
It starts with:
"What do you had not been i had been the same"...
Which doesn't really make much sense. By the end of the ten short epochs, it produces something along the lines of:
"What do you mean that is the most ordinary man of a man of course"...
While the second sentence still doesn't make too much sense - it's much more sensical than the first. Longer training on more data (with more intricate preprocessing steps) would yield better results. We've only trained it on 10 epochs with high dropout to combat the small dataset size. If it were left training for much longer, it would produce very Fyodor-like text, because it would've memorized large chunks of it.
Note: Since the output is fairly verbose, you can tweak the verbose
argument while fitting the model to reduce the amount of text on screen.
Model Inference
To perform inference, we'll want to replicate the interface of the TextSampler
- a method that accepts a seed and a response_length
(max_tokens
). We'll use the same methods as within the sampler:
def sample_token(logits):
logits, indices = tf.math.top_k(logits, k=5, sorted=True)
indices = np.asarray(indices).astype("int32")
preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
preds = np.asarray(preds).astype("float32")
return np.random.choice(indices, p=preds)
def generate_text(prompt, response_length=20):
decoded_sample = prompt
for i in range(response_length-1):
tokenized_prompt = vectorize_layer([decoded_sample])[:, :-1]
predictions = model.predict([tokenized_prompt], verbose=0)
sample_index = len(decoded_sample.strip().split())-1
sampled_token = sample_token(predictions[0][sample_index])
sampled_token = index_lookup[sampled_token]
decoded_sample += " " + sampled_token
return decoded_sample
Now, you can run the method on new samples:
generate_text('the truth ultimately is')
# 'the truth ultimately is really a joy in history, this state of life through which is almost invisible, superfluous teleological'
generate_text('the truth ultimately is')
# 'the truth ultimately is not to make it a little thing to go into your own life for some'
Improving Results?
So, how can you improve results? There are some pretty actionable things you could do:
- Data cleaning (clean the input data more meticulously, we just trimmed an approximate number from the start and removed newline characters)
- Get more data (we only worked with a few megabytes of text data)
- Scale the model alongside the data (stacking decoders isn't hard!)
Conclusion
While the preprocessing pipeline is minimalistic and can be improved - the pipeline outlined in this guide produced a decent GPT-style model, with just 5 lines of code required to build a custom decoder-only transformer, using Keras!
Transformers are popular and widely-applicable for generic sequence modeling (and many things can be expressed as sequences). So far, the main barrier to entry was a cumbersome implementation, but with KerasNLP - deep learning practitioners can leverage the implementations to build models quickly and easily.