In 1974, Ray Kurzweil's company developed the "Kurzweil Reading Machine" - an omni-font OCR machine used to read text out loud. This machine was meant for the blind, who couldn't read visually, but who could now enjoy entire books being read to them without laborious conversion to braille. It opened doors that were closed for many for a long time. Though, what about images?
While giving a diagnosis from X-ray images, doctors also typically document findings such as:
"The lungs are clear. The heart and pulmonary are normal. Mediastinal contours are normal. Pleural spaces are clear. No acute cardiopulmonary disease."
Websites that catalog images and offer search capabilities can benefit from extracting captions of images and comparing their similarity to the search query. Virtual assistants could parse images as additional input to understand a user's intentions before providing an answer.
In a sense - Image Captioning can be used to explain vision models and their findings.
The major hurdle is that you need caption data. For highly-specialized use cases, you probably won't have access to this data. For instance, in our Breast Cancer project, there were no comments associated with a diagnosis, and we're not particularly quallified to make captions ourselves. Captioning images takes time. Lots of it. Many big datasets that have captions have crowdsourced them, and in most cases, multiple captions are applied to a single image, since various people would describe them in various ways. Realizing the use cases of image captioning and descriptions - more datasets are springing up, but this is still a relatively young field, with more datasets yet to come.