Transformer Token and Position Embedding with Keras

Transformer Token and Position Embedding with Keras


There are plenty of guides explaining how transformers work, and for building an intuition on a key element of them - token and position embedding.

Positionally embedding tokens allowed transformers to represent non-rigid relationships between tokens (usually, words), which is much better at modelling our context-driven speech in language modelling. While the process is relatively simple, it's fairly generic, and the implementations quickly become boilerplate.

In this short guide, we'll take a look at how we can use KerasNLP, the official Keras add-on, to perform PositionEmbedding and TokenAndPositionEmbedding.


KerasNLP is a horizontal addition for NLP. As of writing, it's still very young, at version 0.3, and the documentation is still fairly brief, but the package is more than just usable already.

It provides access to Keras layers, such as TokenAndPositionEmbedding, TransformerEncoder and TransformerDecoder, which makes building custom transformers easier than ever.

To use KerasNLP in our project, you can install it via pip:

$ pip install keras_nlp

Once imported into the project, you can use any keras_nlp layer as a standard Keras layer.

If you're interested in applying KerasNLP layers further and building a Transformer-decoder text generator, read our "5-Line GPT-Style Text Generation in Python with TensorFlow/Keras"!


Computers work with numbers. We voice our thoughts in words. To allow computer to crunch through them, we'll have to map words to numbers in some form.

A common way to do this is to simply map words to numbers where each integer represents a word. A corpus of words creates a vocabulary, and each word in the vocabulary gets an index. Thus, you can turn a sequence of words into a sequence of indices known as tokens:

def tokenize(sequence):
    # ...
    return tokenized_sequence

sequence = ['I', 'am', 'Wall-E']
sequence = tokenize(sequence)
print(sequence) # [4, 26, 472]

This sequence of tokens can then be embedded into a dense vector that defines the tokens in latent space:

[[4], [26], [472]] -> [[0.5, 0.25], [0.73, 0.2], [0.1, -0.75]]

This is typically done with the Embedding layer in Keras. Transformers don't encode only using a standard Embedding layer. They perform Embedding and PositionEmbedding, and add them together, displacing the regular embeddings by their position in latent space.

With KerasNLP - performing TokenAndPositionEmbedding combines regular token embedding (Embedding) with positional embedding (PositionEmbedding).


Let's take a look at PositionEmbedding first. It accepts tensors and ragged tensors, and assumes that the final dimension represents the features, while the second-to-last dimension represents the sequence.

# Seq
(5, 10)
     # Features

The layer accepts a sequence_length argument, denoting, well, the length of the input and output sequence. Let's go ahead and positionally embed a random uniform tensor:

seq_length = 5
input_data = tf.random.uniform(shape=[5, 10])

input_tensor = keras.Input(shape=[None, 5, 10])
output = keras_nlp.layers.PositionEmbedding(sequence_length=seq_length)(input_tensor)
model = keras.Model(inputs=input_tensor, outputs=output)

This results in:

<tf.Tensor: shape=(5, 10), dtype=float32, numpy=
array([[ 0.23758471, -0.16798696, -0.15070847,  0.208067  , -0.5123104 ,
        -0.36670157,  0.27487397,  0.14939266,  0.23843127, -0.23328197],
       [-0.51353353, -0.4293166 , -0.30189738, -0.140344  , -0.15444171,
        -0.27691704,  0.14078277, -0.22552207, -0.5952263 , -0.5982155 ],
       [-0.265581  , -0.12168896,  0.46075982,  0.61768025, -0.36352775,
        -0.14212841, -0.26831496, -0.34448475,  0.4418767 ,  0.05758983],
       [-0.46500492, -0.19256318, -0.23447984,  0.17891657, -0.01812166,
        -0.58293337, -0.36404118,  0.54269964,  0.3727749 ,  0.33238482],
       [-0.2965023 , -0.3390794 ,  0.4949159 ,  0.32005525,  0.02882379,
        -0.15913549,  0.27996767,  0.4387421 , -0.09119213,  0.1294356 ]],


Token and position embedding boils down to using Embedding on the input sequence, PositionEmbedding on the embedded tokens, and then adding these two results together, effectively displacing the token embeddings in space to encode their relative meaningful relationships.

This can technically be done as:

seq_length = 10
vocab_size = 25
embed_dim = 10

input_data = tf.random.uniform(shape=[5, 10])

input_tensor = keras.Input(shape=[None, 5, 10])
embedding = keras.layers.Embedding(vocab_size, embed_dim)(input_tensor)
position = keras_nlp.layers.PositionEmbedding(seq_length)(embedding)
output = keras.layers.add([embedding, position])
model = keras.Model(inputs=input_tensor, outputs=output)
model(input_data).shape # ([5, 10, 10])

The inputs are embedded, and then positionally embedded, after which they're added together, producing a new positionally embedded shape. Alternatively, you can leverage the TokenAndPositionEmbedding layer, which does this under the hood:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

... #
def call(self, inputs):
        embedded_tokens = self.token_embedding(inputs)
        embedded_positions = self.position_embedding(embedded_tokens)
        outputs = embedded_tokens + embedded_positions
        return outputs

This makes it much cleaner to perform TokenAndPositionEmbedding:

seq_length = 10
vocab_size = 25
embed_dim = 10

input_data = tf.random.uniform(shape=[5, 10])

input_tensor = keras.Input(shape=[None, 5, 10])
output = keras_nlp.layers.TokenAndPositionEmbedding(vocabulary_size=vocab_size, 
model = keras.Model(inputs=input_tensor, outputs=output)
model(input_data).shape # ([5, 10, 10])

The data we've passed into the layer is now positionally embedded in a latent space of 10 dimensions:

<tf.Tensor: shape=(5, 10, 10), dtype=float32, numpy=
array([[[-0.01695484,  0.7656435 , -0.84340465,  0.50211895,
         -0.3162892 ,  0.16375223, -0.3774369 , -0.10028353,
         -0.00136751, -0.14690581],
        [-0.05646318,  0.00225556, -0.7745967 ,  0.5233861 ,
         -0.22601983,  0.07024342,  0.0905793 , -0.46133494,
         -0.30130145,  0.451248  ],

Going Further - Hand-Held End-to-End Project

Your inquisitive nature makes you want to go further? We recommend checking out our Guided Project: "Image Captioning with CNNs and Transformers with Keras".

In this guided project - you'll learn how to build an image captioning model, which accepts an image as input and produces a textual caption as the output.

You'll learn how to:

  • Preprocess text
  • Vectorize text input easily
  • Work with the API and build performant Datasets
  • Build Transformers from scratch with TensorFlow/Keras and KerasNLP - the official horizontal addition to Keras for building state-of-the-art NLP models
  • Build hybrid architectures where the output of one network is encoded for another

How do we frame image captioning? Most consider it an example of generative deep learning, because we're teaching a network to generate descriptions. However, I like to look at it as an instance of neural machine translation - we're translating the visual features of an image into words. Through translation, we're generating a new representation of that image, rather than just generating new meaning. Viewing it as translation, and only by extension generation, scopes the task in a different light, and makes it a bit more intuitive.

Framing the problem as one of translation makes it easier to figure out which architecture we'll want to use. Encoder-only Transformers are great at understanding text (sentiment analysis, classification, etc.) because Encoders encode meaningful representations. Decoder-only models are great for generation (such as GPT-3), since decoders are able to infer meaningful representations into another sequence with the same meaning. Translation is typically done by an encoder-decoder architecture, where encoders encode a meaningful representation of a sentence (or image, in our case) and decoders learn to turn this sequence into another meaningful representation that's more interpretable for us (such as a sentence).


Transformers have made a large wave since 2017, and many great guides offer insight into how they work, yet, they were still elusive to many due to the overhead of custom implementations. KerasNLP adresses this problem, providing building blocks that let you build flexible, powerful NLP systems, rather than providing pre-packaged solutions.

In this guide, we've taken a look at token and position embedding with Keras and KerasNLP.

Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

David LandupEditor

Entrepreneur, Software and Machine Learning Engineer, with a deep fascination towards the application of Computation and Deep Learning in Life Sciences (Bioinformatics, Drug Discovery, Genomics), Neuroscience (Computational Neuroscience), robotics and BCIs.

Great passion for accessible education and promotion of reason, science, humanism, and progress.


Hands-On House Price Prediction - Machine Learning in Python

# deep learning# tensorflow# machine learning# python

If you've gone through the experience of moving to a new house or apartment - you probably remember the stressful experience of choosing a property,...

David Landup
Ammar Alyousfi
Jovana Ninkovic

Image Captioning with CNNs and Transformers with Keras

# artificial intelligence# deep learning# python# nlp

In 1974, Ray Kurzweil's company developed the &quot;Kurzweil Reading Machine&quot; - an omni-font OCR machine used to read text out loud. This machine...

David Landup
David Landup

© 2013-2022 Stack Abuse. All rights reserved.