In 1974, Ray Kurzweil's company developed the &quot;Kurzweil Reading Machine&quot; - an omni-font OCR machine used to read text out loud. This machine was meant for the blind, who couldn't read visually, but who could now enjoy entire books being read to them without laborious conversion to braille. It opened doors that were closed for many for a long time. Though, what about images?
While giving a diagnosis from X-ray images, doctors also typically document findings such as:
<blockquote>
&quot;The lungs are clear. The heart and pulmonary are normal. Mediastinal contours are normal. Pleural spaces are clear. No acute cardiopulmonary disease.&quot;
</blockquote>
Websites that catalog images and offer search capabilities can benefit from extracting captions of images and comparing their similarity to the search query. Virtual assistants could parse images as additional input to understand a user's intentions before providing an answer.
<blockquote>
In a sense - Image Captioning can be used to explain vision models and their findings.
</blockquote>
The major hurdle is that you need caption data. For highly-specialized use cases, you probably won't have access to this data. For instance, in our Breast Cancer project, there were no comments associated with a diagnosis, and we're not particularly quallified to make captions ourselves. Captioning images takes time. Lots of it. Many big datasets that have captions have crowdsourced them, and in most cases, multiple captions are applied to a single image, since various people would describe them in various ways. Realizing the use cases of image captioning and descriptions - more datasets are springing up, but this is still a relatively young field, with more datasets yet to come.

David Landup

Image Captioning

With our datasets primed and ready to go - we can define the model. Using KerasNLP, we can fairly easily implement a transformer from scratch:
<pre><code class="hljs"># Encoder
encoder_inputs = keras.Input(shape=(None,))
x = keras_nlp.layers.TokenAndPositionEmbedding(...)(encoder_inputs)
encoder_outputs = keras_nlp.layers.TransformerEncoder(...)(inputs=x)

encoder = keras.Model(encoder_inputs, encoder_outputs)

# Decoder
decoder_inputs = keras.Input(shape=(None,))
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM))
x = keras_nlp.layers.TokenAndPositionEmbedding(...)(decoder_inputs)
x = keras_nlp.layers.TransformerDecoder(...)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)
x = keras.layers.Dropout(0.5)(x)
decoder_outputs = keras.layers.Dense(...)(x)

decoder = keras.Model([decoder_inputs,encoded_seq_inputs], decoder_outputs)
# The output of the transformer is the output of the decoder
transformer_outputs = decoder([decoder_inputs, encoder_outputs])

transformer = keras.Model([encoder_inputs, decoder_inputs], transformer_outputs)
</code></pre>
The input is followed by <code>TokenAndPositionEmbedding()</code> and either a <code>TransformerEncoder</code> or <code>TransformerDecoder</code>. The model is put together by feeding the output of the encoder into the decoder, besides the input it already gets, which is embedded positionally. This architecture is pretty much a perfect reflection of the diagram from the paper. Now, we'll have to make a few tweaks here - transformers work on sequences, and our images aren't sequences, nor are the feature maps output by a ConvNet.

Building a Transformer-Based, CNN-Powered Image Captioning Model

Let's start with importing all of the packages and libraries we'll be using:
<pre><code class="hljs"># tensorflow version
import tensorflow as tf
print(&#x27;tensorflow: %s&#x27; % tf.__version__)
# keras version
from tensorflow import keras
print(&#x27;keras: %s&#x27; % keras.__version__)
import keras_cv
print(&#x27;keras_cv: %s&#x27; % keras_cv.__version__)
import keras_nlp
print(&#x27;keras_nlp: %s&#x27; % keras_nlp.__version__)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cv2

import os
</code></pre>
<h4 id="downloadingthedata">Downloading the Data</h4>
Next up, let's download the dataset we'll be working with - Flickr8k. It was removed from the website publically, but is widely available on Kaggle and other repositories where others are now hosting it. It contains 8K images with 5 human-written captions each. A larger one would be Flickr30K which follows the same format - so you can easily substitute it for a larger one in this project.
Let's use <code>kaggle datasets</code> to download the dataset and unzip it:
<pre><code class="hljs">! kaggle datasets download -d adityajn105/flickr8k
! unzip flickr8k.zip -d Flickr8k_Dataset
</code></pre>
It's unzipped into a <code>Flickr8k_Dataset</code> directory, with a text file, named <code>captions.txt</code>, and an <code>Images</code> directory containing all of the images. Let's save this useful information in a <code>config</code> dictionary, alongside the batch size, for global access:

Data Preprocessing

In 1974, Ray Kurzweil's company developed the &quot;Kurzweil Reading Machine&quot; - an omni-font OCR machine used to read text out loud. This machine was meant for the blind, who couldn't read visually, but who could now enjoy entire books being read to them without laborious conversion to braille. It opened doors that were closed for many for a long time. Though, what about images?
While giving a diagnosis from X-ray images, doctors also typically document findings such as:
<blockquote>
&quot;The lungs are clear. The heart and pulmonary are normal. Mediastinal contours are normal. Pleural spaces are clear. No acute cardiopulmonary disease.&quot;
</blockquote>
Websites that catalog images and offer search capabilities can benefit from extracting captions of images and comparing their similarity to the search query. Virtual assistants could parse images as additional input to understand a user's intentions before providing an answer.
<blockquote>
In a sense - Image Captioning can be used to explain vision models and their findings.
</blockquote>
The major hurdle is that you need caption data. For highly-specialized use cases, you probably won't have access to this data. For instance, in our Breast Cancer project, there were no comments associated with a diagnosis, and we're not particularly quallified to make captions ourselves. Captioning images takes time. Lots of it. Many big datasets that have captions have crowdsourced them, and in most cases, multiple captions are applied to a single image, since various people would describe them in various ways. Realizing the use cases of image captioning and descriptions - more datasets are springing up, but this is still a relatively young field, with more datasets yet to come.
Even today, there are great, large-scale datasets that you can train image captioners on. Some of them include Flickr's compilations, known as Flickr8K and Flickr30K and MS COCO.
MS COCO is large - and contains other metadata that allows us to create object recognition systems with bounding boxes. We'll be using MS COCO in a later project on object recognition and will opt for a different dataset for this one because of that.
<blockquote>
MS COCO is standardized won't require much preprocessing steps to get the caption-image relationships down. We'll purposefully work with a dataset that will require a bit more preprocessing to practice handling different formats and combining multi-file data (text in one file and images in a folder).
</blockquote>
So, how do we frame image captioning? Most consider it an example of generative deep learning, because we're teaching a network to generate descriptions. However, I like to look at it as an instance of neural machine translation - we're translating the visual features of an image into words. Through translation, we're generating a new representation of that image, rather than just generating new meaning. Viewing it as translation, and only by extension generation, scopes the task in a different light, and makes it a bit more intuitive. When you experience something visually - it's hard to really convey it into words and a lot of the &quot;magic&quot; of the moment is lost. We translate our experience into a different format that can be conveyed to someone else, and they generate a sort of experience based on our prompts. This is actually the other side of the coin - image generation from textual prompts! Recently, projects like DALL·E have been making waves by creating amazing visual representations from textual prompts.
Recently, a Twitter user shared a generated image of Master Yoda, robbing a store, caught on a CCTV camera:
<img src="https://s3.stackabuse.com/media/guided+projects/image-captioning-with-cnns-and-transformers-with-keras-00.png" alt="">
Similar examples include Gandalf wrestling John Cena and Peppa the Pig boxing professional athletes. This is also, in a way, translation of an input prompt into visual features, and only by extension is a form of generation.
While it's plain funny to see a character in a situation you wouldn't expect them to be in - prompt-to-image translation can actually have a lot of implications for the way we communicate.
<blockquote>
&quot;Nevermind, you had to be there.&quot;
</blockquote>
You after talking about something funny that happened and that didn't end up being so funny when explained through words.
We experience something and lose much in translation into words. Some are exceptional in their ability to stoke your imagination with words, and poets and other authors have been rightfully regarded as artists because of this ability. Since image captioning and prompt-to-image generation are two ends of the same translation process - could we train a network to turn images to text and then that text back into images?
If the mapping can be fairly similar - you could share your experiences and memories more vividly than ever before. You could not only read about the fantastic adventures of Bilbo Baggins, but also experience them visually. While the generated images from your explanations would fall short of your subjective experience, they can usher a new age of digital communication.
<blockquote>
Both of these tasks are at the intersection of Computer Vision and Natural Language Processing - both being analogous to important faculties of our own perception.
</blockquote>
Framing the problem as one of translation makes it easier to figure out which architecture we'll want to use. Encoder-only Transformers are great at understanding text (sentiment analysis, classification, etc.) because Encoders encode meaningful representations. Decoder-only models are great for generation (such as GPT-3), since decoders are able to infer meaningful representations into another sequence with the same meaning. Translation is typically done by an encoder-decoder architecture, where encoders encode a meaningful representation of a sentence (or image, in our case) and decoders learn to turn this sequence into another meaningful representation that's more interpretable for us (such as a sentence).
<blockquote>
In this guided project - you'll learn how to build an image captioning model, which accepts an image as input and produces a textual caption as the output.
</blockquote>

 <div class="alert alert-note">
 <div class="flex">
 
 <div class="flex-shrink-0 mr-3">
 <img src="/assets/images/icon-information-circle-solid.svg" class="icon" aria-hidden="true" />
 </div>
 
 Note: This Guided Project is part of our in-depth course on <a target="_blank" href="https://stackabuse.com/courses/practical-deep-learning-for-computer-vision-with-python/">Practical Deep Learning for Computer Vision</a> and assumes that you've read the previous lessons or have that prerequisite knowledge from before.

 </div>
 </div>