Introduction
There are plenty of guides explaining how transformers work, and for building an intuition on a key element of them - token and position embedding.
Positionally embedding tokens allowed transformers to represent non-rigid relationships between tokens (usually, words), which is much better at modeling our context-driven speech in language modeling. While the process is relatively simple, it's fairly generic, and the implementations quickly become boilerplate.
In this short guide, we'll take a look at how we can use KerasNLP, the official Keras add-on, to perform
PositionEmbedding
andTokenAndPositionEmbedding
.
KerasNLP
KerasNLP is a horizontal addition for NLP. As of writing, it's still very young, at version 0.3, and the documentation is still fairly brief, but the package is more than just usable already.
It provides access to Keras layers, such as TokenAndPositionEmbedding
, TransformerEncoder
and TransformerDecoder
, which makes building custom transformers easier than ever.
To use KerasNLP in our project, you can install it via pip
:
$ pip install keras_nlp
Once imported into the project, you can use any keras_nlp
layer as a standard Keras layer.
If you're interested in applying KerasNLP layers further and building a Transformer-decoder text generator, read our "5-Line GPT-Style Text Generation in Python with TensorFlow/Keras"!
Tokenization
Computers work with numbers. We voice our thoughts in words. To allow computers to crunch through them, we'll have to map words to numbers in some form.
A common way to do this is to simply map words to numbers where each integer represents a word. A corpus of words creates a vocabulary, and each word in the vocabulary gets an index. Thus, you can turn a sequence of words into a sequence of indices known as tokens:
def tokenize(sequence):
# ...
return tokenized_sequence
sequence = ['I', 'am', 'Wall-E']
sequence = tokenize(sequence)
print(sequence) # [4, 26, 472]
With Keras, tokenization is typically done via the TextVectorization
layer, which works wonderfully for a wide variety of inputs and supports several output modes (the default one being int
which works as previously described):
# Instantiate
vectorize = keras.layers.TextVectorization(
max_tokens=max_features,
output_mode='int',
output_sequence_length=max_len)
# Adapt to text
vectorize.adapt(text_dataset)
# Vectorize new input
vectorized_text = vectorize(['some input'])
You can use this layer as a standalone layer for preprocessing or as part of a Keras model, to make the preprocessing truly end-to-end, and supply raw input to the model. This guide is aimed at token embedding, not tokenization, so I won't dive further into the layer, which will be the main topic of another guide.
This sequence of tokens can then be embedded into a dense vector that defines the tokens in latent space:
[[4], [26], [472]] -> [[0.5, 0.25], [0.73, 0.2], [0.1, -0.75]]
This is typically done with the Embedding
layer in Keras. Transformers don't encode only using a standard Embedding
layer. They perform Embedding
and PositionEmbedding
, and add them together, displacing the regular embeddings by their position in latent space.
With KerasNLP - performing TokenAndPositionEmbedding
combines regular token embedding (Embedding
) with positional embedding (PositionEmbedding
).
PositionEmbedding
Let's take a look at PositionEmbedding
first. It accepts tensors and ragged tensors, and assumes that the final dimension represents the features, while the second-to-last dimension represents the sequence.
# Seq
(5, 10)
# Features
The layer accepts a sequence_length
argument, denoting, well, the length of the input and output sequence. Let's go ahead and positionally embed a random uniform tensor:
seq_length = 5
input_data = tf.random.uniform(shape=[5, 10])
input_tensor = keras.Input(shape=[None, 5, 10])
output = keras_nlp.layers.PositionEmbedding(sequence_length=seq_length)(input_tensor)
model = keras.Model(inputs=input_tensor, outputs=output)
model(input_data)
This results in:
<tf.Tensor: shape=(5, 10), dtype=float32, numpy=
array([[ 0.23758471, -0.16798696, -0.15070847, 0.208067 , -0.5123104 ,
-0.36670157, 0.27487397, 0.14939266, 0.23843127, -0.23328197],
[-0.51353353, -0.4293166 , -0.30189738, -0.140344 , -0.15444171,
-0.27691704, 0.14078277, -0.22552207, -0.5952263 , -0.5982155 ],
[-0.265581 , -0.12168896, 0.46075982, 0.61768025, -0.36352775,
-0.14212841, -0.26831496, -0.34448475, 0.4418767 , 0.05758983],
[-0.46500492, -0.19256318, -0.23447984, 0.17891657, -0.01812166,
-0.58293337, -0.36404118, 0.54269964, 0.3727749 , 0.33238482],
[-0.2965023 , -0.3390794 , 0.4949159 , 0.32005525, 0.02882379,
-0.15913549, 0.27996767, 0.4387421 , -0.09119213, 0.1294356 ]],
dtype=float32)>
TokenAndPositionEmbedding
Token and position embedding boils down to using Embedding
on the input sequence, PositionEmbedding
on the embedded tokens, and then adding these two results together, effectively displacing the token embeddings in space to encode their relative meaningful relationships.
This can technically be done as:
seq_length = 10
vocab_size = 25
embed_dim = 10
input_data = tf.random.uniform(shape=[5, 10])
input_tensor = keras.Input(shape=[None, 5, 10])
embedding = keras.layers.Embedding(vocab_size, embed_dim)(input_tensor)
position = keras_nlp.layers.PositionEmbedding(seq_length)(embedding)
output = keras.layers.add([embedding, position])
model = keras.Model(inputs=input_tensor, outputs=output)
model(input_data).shape # ([5, 10, 10])
The inputs are embedded, and then positionally embedded, after which they're added together, producing a new positionally embedded shape. Alternatively, you can leverage the TokenAndPositionEmbedding
layer, which does this under the hood:
... # https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/layers/token_and_position_embedding.py
def call(self, inputs):
embedded_tokens = self.token_embedding(inputs)
embedded_positions = self.position_embedding(embedded_tokens)
outputs = embedded_tokens + embedded_positions
return outputs
This makes it much cleaner to perform TokenAndPositionEmbedding
:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
seq_length = 10
vocab_size = 25
embed_dim = 10
input_data = tf.random.uniform(shape=[5, 10])
input_tensor = keras.Input(shape=[None, 5, 10])
output = keras_nlp.layers.TokenAndPositionEmbedding(vocabulary_size=vocab_size,
sequence_length=seq_length,
embedding_dim=embed_dim)(input_tensor)
model = keras.Model(inputs=input_tensor, outputs=output)
model(input_data).shape # ([5, 10, 10])
The data we've passed into the layer is now positionally embedded in a latent space of 10 dimensions:
model(input_data)
<tf.Tensor: shape=(5, 10, 10), dtype=float32, numpy=
array([[[-0.01695484, 0.7656435 , -0.84340465, 0.50211895,
-0.3162892 , 0.16375223, -0.3774369 , -0.10028353,
-0.00136751, -0.14690581],
[-0.05646318, 0.00225556, -0.7745967 , 0.5233861 ,
-0.22601983, 0.07024342, 0.0905793 , -0.46133494,
-0.30130145, 0.451248 ],
...
Conclusions
Transformers have made a large wave since 2017, and many great guides offer insight into how they work, yet, they are still elusive to many due to the overhead of custom implementations. KerasNLP addresses this problem, providing building blocks that let you build flexible, powerful NLP systems, rather than providing pre-packaged solutions.
In this guide, we've taken a look at token and position embedding with Keras and KerasNLP.