Image Captioning with CNNs and Transformers with Keras

David Landup
David Landup

Overview

In 1974, Ray Kurzweil's company developed the "Kurzweil Reading Machine" - an omni-font OCR machine used to read text out loud. This machine was meant for the blind, who couldn't read visually, but who could now enjoy entire books being read to them without laborious conversion to braille. It opened doors that were closed for many for a long time. Though, what about images?

While giving a diagnosis from X-ray images, doctors also typically document findings such as:

"The lungs are clear. The heart and pulmonary are normal. Mediastinal contours are normal. Pleural spaces are clear. No acute cardiopulmonary disease."

Websites that catalog images and offer search capabilities can benefit from extracting captions of images and comparing their similarity to the search query. Virtual assistants could parse images as additional input to understand a user's intentions before providing an answer.

In a sense - Image Captioning can be used to explain vision models and their findings.

The major hurdle is that you need caption data. For highly-specialized use cases, you probably won't have access to this data. For instance, in our Breast Cancer project, there were no comments associated with a diagnosis, and we're not particularly quallified to make captions ourselves. Captioning images takes time. Lots of it. Many big datasets that have captions have crowdsourced them, and in most cases, multiple captions are applied to a single image, since various people would describe them in various ways. Realizing the use cases of image captioning and descriptions - more datasets are springing up, but this is still a relatively young field, with more datasets yet to come.

Even today, there are great, large-scale datasets that you can train image captioners on. Some of them include Flickr's compilations, known as Flickr8K and Flickr30K and MS COCO.

MS COCO is large - and contains other metadata that allows us to create object recognition systems with bounding boxes. We'll be using MS COCO in a later project on object recognition and will opt for a different dataset for this one because of that.

MS COCO is standardized won't require much preprocessing steps to get the caption-image relationships down. We'll purposefully work with a dataset that will require a bit more preprocessing to practice handling different formats and combining multi-file data (text in one file and images in a folder).

So, how do we frame image captioning? Most consider it an example of generative deep learning, because we're teaching a network to generate descriptions. However, I like to look at it as an instance of neural machine translation - we're translating the visual features of an image into words. Through translation, we're generating a new representation of that image, rather than just generating new meaning. Viewing it as translation, and only by extension generation, scopes the task in a different light, and makes it a bit more intuitive. When you experience something visually - it's hard to really convey it into words and a lot of the "magic" of the moment is lost. We translate our experience into a different format that can be conveyed to someone else, and they generate a sort of experience based on our prompts. This is actually the other side of the coin - image generation from textual prompts! Recently, projects like DALL·E have been making waves by creating amazing visual representations from textual prompts.

Recently, a Twitter user shared a generated image of Master Yoda, robbing a store, caught on a CCTV camera:

Similar examples include Gandalf wrestling John Cena and Peppa the Pig boxing professional athletes. This is also, in a way, translation of an input prompt into visual features, and only by extension is a form of generation.

While it's plain funny to see a character in a situation you wouldn't expect them to be in - prompt-to-image translation can actually have a lot of implications for the way we communicate.

"Nevermind, you had to be there."

You after talking about something funny that happened and that didn't end up being so funny when explained through words.

We experience something and lose much in translation into words. Some are exceptional in their ability to stoke your imagination with words, and poets and other authors have been rightfully regarded as artists because of this ability. Since image captioning and prompt-to-image generation are two ends of the same translation process - could we train a network to turn images to text and then that text back into images?

If the mapping can be fairly similar - you could share your experiences and memories more vividly than ever before. You could not only read about the fantastic adventures of Bilbo Baggins, but also experience them visually. While the generated images from your explanations would fall short of your subjective experience, they can usher a new age of digital communication.

Both of these tasks are at the intersection of Computer Vision and Natural Language Processing - both being analogous to important faculties of our own perception.

Framing the problem as one of translation makes it easier to figure out which architecture we'll want to use. Encoder-only Transformers are great at understanding text (sentiment analysis, classification, etc.) because Encoders encode meaningful representations. Decoder-only models are great for generation (such as GPT-3), since decoders are able to infer meaningful representations into another sequence with the same meaning. Translation is typically done by an encoder-decoder architecture, where encoders encode a meaningful representation of a sentence (or image, in our case) and decoders learn to turn this sequence into another meaningful representation that's more interpretable for us (such as a sentence).

In this guided project - you'll learn how to build an image captioning model, which accepts an image as input and produces a textual caption as the output.

Note: This Guided Project is part of our in-depth course on Practical Deep Learning for Computer Vision and assumes that you've read the previous lessons or have that prerequisite knowledge from before.

What is a Guided Project?

Turn Theory Into Practice

All great learning resources, books and courses teach you the holistic basics, or even intermediate concepts, and advise you to practice after that. As soon as you boot up your own project - the environment suddenly isn't as pristine as in the courses and books! Things go wrong, and it's oftentimes hard to pinpoint even why they do go wrong.

StackAbuse Guided Projects are there to bridge the gap between theory and actual work. We'll respect your knowledge and intelligence, and assume you know the theory. Time to put it into practice.

When applicable, Guided Projects come with downloadable, reusable scripts that you can refer back to whenever required in your new day-to-day work.

Last Updated: Jul 2022

© 2013-2025 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms