Introduction to Computer Vision
The Role of Vision
Take a moment to look around you. There's stuff all around you! If you're sitting at a desk, you're likely in front of a monitor, and a keyboard - and maybe a hot (or cold) beverage of choice in a mug. If you're in public transport, there are pipes, shoes, people, seats.
How long did it take for you to infer what these objects are and how laborious was the process?
When you look down at your coffee mug, you're not at all surprised by the object in front of you. You've been able to do that since you were much younger than you are now. If you can read this - you've already trained yourself in visual pattern recognition, and context recognition, stringing together words in a sentence to form a coherent message. Way before you can read, even as a baby, you can distinguish between a cat and a dog, as well as between a bird and a whale. Whether you're aware of the labels we've assigned to them ("dog", "chien", "собака", "Hund", etc.) doesn't change the fact that the features that make up a dog are different from the features that make up a whale.
Vision is one of the dominant senses in humans.
Other senses, and more particularly, our perception of our surroundings depends on vision. "Seeing is believing" is a very common quote, implying that something doesn't exist if we can't see it. Naturally, we can't deny the existence of gravity or electrical forces because we just can't see them, and the quote is outdated. Additionally, when you see something while under the influence of psychodelics - it doesn't necessarily mean that it exists outside of and regardless of your own subjective experience. While arguments could be made against the semantics of the quote, such as that we see the effects of gravity, and therefore, we need to see to believe, the role of sight is integral to our perception of the world around us.
How Do We See?
Note This section is meant for a wider public. It will condense some neuroscience and computer science research and concepts into a digestible format, omitting some technical details and simplifying the vast fields, in order to paint a picture that you can turn into a foundation. The point is to give you a framework of thought that's useful to use when thinking about vision and perception, and to help scope what it exactly is that we're trying to solve with computer vision.
It's fairly well known how our eyes work and how the signal gets to our visual cortex, that makes sense of it. Modeling the eye, we made extremely powerful and complex cameras, which work in much the same way.
While eyes have cones and rods that react to the different frequencies of visible light, cameras have arrays of light-responsive sensors (known as photodiodes) that do the same. Cameras are, quite literally, mechanical eyes and work using the same principles as our organs!
Though, where things become somewhat unclear is after that. How are these "images" represented in our brains and how do we exactly make sense of them. In recent years, amazing advancements have been made into the theory of the visual cortex (and the neocortex in general), but there's still a lot left to be discovered. In computers, images are represented as sets of binary digits, which can be decoded into a set of RGB values that are to be displayed on a screen. This is partly due to the way sensors and monitors work - recording spatial intensity of light on the sensors and displaying that same information on a monitor.
There is no solidified scientific consensus on perception and consciousness, yet, there are useful frameworks of thought.
There are many promising theories and advances are being made with each passing year, but we still lack a unified theory of how information entering your eyes gives rise to your experience. However, tremendous strides have been made and some mechanisms of the brain have been understood well. How we perceive information can be decoupled from how we process information, and we know a thing or two about how the visual cortex uses the signals that come from our eyes, and some guesses as to how it can be decoded into our perception.
To solve computer vision, we need to solve vision (information processing and perception). That's no small task. While simplified, it helps to think of vision as a two-stage process - encoding and decoding. This concept is also one of the key concepts in computer science, and a driving factor behind various neural network architectures and systems, some of which will be covered later in the course.
Encoding is useful for understanding, as it captures the salient information in a signal and reshapes it into something more useful (whatever useful may mean for a certain goal).
Decoding is useful for explaining, as it takes the salient encoded information and turns it either into an approximation of the original signal, or any reshaped form of it.
In the context of deep learning, this concept is applicable in various fields. Some of these (that concern us for this course) are:
- Natural Language Processing (NLP) - encoder networks are great at understanding sentences, while decoder networks are great at generation.
- Computer Vision (CV) - encoding visual information is usually done via neural networks we call Convolutional Neural Networks (CNNs or ConvNets for short). Encoder-decoder architectures can be used for generative neural networks that project meaningful signals into images. We'll create an encoder-decoder network to describe images using text later in the course.
In the context of computer science, the concept is applicable to such a wide variety of applications that you're relying on encoding-decoding systems probably every second of your life.
The information we capture (encode) is different from what we perceive (decode).
If you choose to represent the state of the world around you as information - there's a sleuth of it. Electromagnetic waves of various frequencies fly around (and through you) at any given moment. Barring all other properties and constituent elements of physics - your eyes filter out only a small portion of the EMS (electromagnetic spectrum) known to us as visible light. The photon that bumps onto a cat and enters your eye isn't qualitatively different from the photon that hit the wall, or a dog. They're informationally decoupled from the object they bumped into. Yet, when it enters your eye, which activates the optic nerve (which again, transfers information in the form of an electric sigal, not the photon itself), and into the visual area of the cerebral cortex (commonly shortened to the visual cortex) - meaningful representations of that information are encoded, even through several layers of abstraction in serialization.
What happens after this is debatable but an increasing number of neuroscientists believe that the brain makes a Bayesian best guess at what explains the incoming signal, and project that prediction into the "3D world" that we see. This inside-out rather than the outside-in model might be unintuitive for some, but it explains much of the phenomena we couldn't explain otherwise. Some, such as Anil Seth, a professor of Cognitive and Computational Neuroscience at the University of Sussex, go as far as to call our perception controlled hallucinations. While the name might make it sound more esoteric than it really is - the idea is that what we perceive is just what we project as prediction of what the world must be to explain the inputs, in a controlled manner. This would be useful evolutionarily, since that allowed us to escape predators, undersand scenes around us, coordinate ourselves, etc.
The inside-out model also allows us to frame computer vision in a different light. Those only concerned with encoding intuitively think of vision as an outside-in process since you obtain information from the 'outside' and make conclusions 'inside'. Vision is more than that.
Whether this theory is right or not doesn't change the fact that thinking in this framework helps scope computer vision. We typically use images obtained through cameras (mechanical eyes), use neural networks to encode information (one part of the visual cortex) and another neural network to decode that information (whether it's to decode it into a class, for classification of images, a caption for the image, a segmentation mask, bounding boxes for objects, or back into another representation of the image):
We'll be building deep learning systems that classify images, feed into decoders to generate text, detect objects and their locations, semantically segment images, and so on in the course. Throughout all of these, the concept of encoding information will be always present, while the way we decode it differs based on your task:
- Object detection
- Caption generation
- Pose Estimation
- Image generation
We'll form a more in-depth view into what these entail later in this lesson.
We've figured out the encoding part further than the decoding part so far, both with deep learning and with neuroscience. Some of the latest emerging advancements are concerned in large part with decoding (VAEs, GANs, diffusion models, etc.).
The first big advancement in utilizing deep learning for computer vision came when Kunihiko Fukushima modeled the way (the encoding part of) the visual cortex works computationally, back in 1979, dubbed the Neocognitron. We're still riding on the avalanche started by that paper but we've also (recently) started exploring some other models on feature extraction.
We'll come back to the Neocognitron later.
Computer Vision and Deep Learning
This segways us into Computer Vision! Computer Vision is a very large field, with a plethora of sub-fields. It was a notoriously difficult field to make advances in, especially before deep learning. Computer vision is difficult, and exciting.
How do you automate the process of object detection, classification, segmentation, etc.?
Humans are exceedingly good at spatial awareness even when standing still. You're aware of whether an object is closer or further away from you, as well as where it's located in respect to other objects. How do you exactly code this? Even if you had all the time in the world to make a rule-based system in which a straight line, with another straight line, that has a curvature on the side is a "mug" - you'd have a hard time thinking of all the possibilities and edge cases. In computer vision - rule-based systems were ruled out fairly quickly as a concept.
As often, we reasoned by analogy. We found what works well and tried modeling it.
From 1958 to 1968, David H. Hubel and Torsten N. Wiesel performed experiments with cats and monkeys, examining how their visual cortex works. Amongst other findings, their experiments reported that for both cats and monkeys, neurons in the visual cortex have a small local receptive field.
This underlies how our neurons encode visual input. Some neurons fire some of the time, on certain small features. Some neurons only fired at straight lines, while some fired only at curved lines. Throughout the experiments, they noticed that some neurons have a larger receptive field, and that these neurons typically reacted to more complex patterns, which were created by the more simple patterns that some other neurons fire for. These were, at the time, dubbed "simple cells" and "complex cells" in absence of a better term.
While the visual cortex isn't as simple and clear cut as "Neuron 1 fires only on horizontal lines, and should thus be named the horizontal neuron" - this discovery was crucial.
The visual cortex, like much of the rest of the neocortex, works with hierarchies, low-level features (straight or curved lines) feed into higher-level features (edges), which feed into higher-level features, until you know how to discern a cat from a dog or a whale, or the letter "A" from the digit "4" even though their constituent features are shared.
This was something to go from! And in 1979 Kunihiko Fukushima published the "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position". The architecture built on top of the idea of S-Cells (simple cells) and C-Cells (complex cells), where simple cells feed into complex cells, creating a hierarchy. It was a self-organizing system, into which you'd feed representative images. This was a precursor to modern Convolutional Neural Networks (CNNs), which have proliferated and advanced the field from recognizing basic shapes and patterns to superhuman ability in recognizing objects and localizing them in images and even live videos.
What was started by Fukushima was extended by Yann LeCunn, one of the "godfathers" of AI, in 1998 via "Gradient-Based Learning Applied to Document Recognition", in which raw input is passed through a feature extractor and the resulting feature vector is passed through a trainable classifier module that determines the class of the input.
If you've worked with CNNs before - this will sound very familiar.
At this point, it's worth taking a step back and realizing that CNNs, like all models, are just models of how the visual cortex works. Just as a digital calculator and an analog calculator can both give the same results - the way they work towards that goal is inherently different. They're not a miniature visual cortex. They're a miniature visual-cortex inspired architecture, that yieds similar performance.
Alternatives to CNNs?
In general, there are two schools of thought:
- We should reason by analogy
- We should reason from first principles
Both methods work, most of the time. When it's unclear from which principles to go from, we might as well reason by analogy to establish a field, and then re-think the process from the ground up with newfound first principles. This is also exactly what happened with computer vision.
Our first attempts at solving computer vision was trying to directly model the visual cortex. Recently, a totally different architecture has been proposed, built on the Attention Mechanism, and the architecture is known as a Transformer. It was proposed in 2017 by the Google Brain and Google Research teams in "Attention Is All You Need". A lot of attention is being shown to the promise of the new architecture and the field is divided - it's very unclear which type of architecture holds the most potential down the line.
Ironically - Vision Transformers took cues from the hierarchical architectures of CNNs, which propelled them upwards in leaderboards! Before that - they weren't very flexible.
For a while, it seemed that transformers are taking the lead and on a website such as PapersWithCode, you can see an up-to-date leaderboard of models and the architectures they're built on. In 2022, at the point in time in which transformers were at the top of the leaderboard, ConvNeXt, a "ConvNet for the 2020s", was released, humbly reminding everyone that there's still much to discover, such as porting ideas of transformers to CNNs (closing the loop of 'inspiration').
Recently - people have been making combinations of CNNs and Transformers! Nobody knows what the future holds, but it's exciting. It's a great idea to keep track of the leaderboard from time to time - it changes very frequently!
Computer Vision Tasks and Applications
Computer Vision is a vast field and can be applied to a wide variety of problems. We can automate a lot of actions we take such as pulling a level, clicking a button, sounding an alarm, etc. but it used to be very hard to automate the decision process that lead to those actions.
Clicking a button is simple! Recognizing a bear near the campsite, for which you'll want to sound an alarm, didn't use to be really simple for computers. While many "box" computer vision into mainly image classification (which factually is a big part of it, in different flavors), it's much much more than just that. A general rule of thumb is:
Can you see that something is different from something else?
If yes - you can apply computer vision to it. If not - you can still probably apply computer vision to it. A popular example is sound! If you join a competition to recognize the sounds of animals, such as birds - you'll find that it's easier to convert the sound into a soundwave or better yet - a spectrogram, and classify them using computer vision, instead of analyzing patterns in the sound sequences themselves.
Here's an example from a Kaggle Notebook written by Shreya, for the BirdCLEF competition:
Shreya's Visualization of bird sounds as a spectrogram
This is what an Ash-throated Flycatcher's ("astfly") sounds look like! Depending on the type of spectrogram you use, the images can very clearly delineate one species of bird from another. A similar technique was used by Ray Kurzweil's company that tried tackling understanding human speech back in the 1980s. They realized that it's easier to plot the frequencies wavelengths and classify phonemes based on those frequencies, rather than analyzing sequences with manually-coded expert systems from that time. In "How to Create a Mind", Kurzweil reflects on the technique:
A spectrogram of a person saying the word 'hide', as taken from Ray Kurzweil's book - "How to Create a Mind"
Computer vision can also be applied to unlikely domains such as malware detection, as proved by a number of researchers such as Mahmoud Kalash, Mrigank Rochan et al. in their paper titled "Malware Classification with Deep Convolutional Neural Networks", Joshua Saxe et al. in their paper titled "Deep neural network based malware detection using two dimensional binary program features", etc. If you turn malware binary into 8-bit binary vectors, and then turn those binary vectors into RGB - you can plot images of the source code. Now, these images don't really mean much to us, but distinct patterns occur in malware that don't appear in regular software.
Figure from "Malware classification with deep convolutional neural networks" by Mahmoud Kalash, Mrigank Rochan, et al.
Kalash and Rochan report a 98.52% and 99.97% accuracy, while many of the other papers report 95%+ accuracy rates on the relevant malware detection benchmarks.
Data visualization is an art in and of itself - and you can plot just about anything if you have data for it. If you can plot it, you can probably apply computer vision to it. In a sense, computer vision can be applied to practically all data - not just images. Images are just pixel intensity data anyway, so the fact that we can apply computer vision to images already itself is built on the fact that we can apply it to data in general. Let's take a look at some popular tasks and applications of computer vision!
Optical Character Recognition
Optical character recognition was one of the first applications of computer vision! It held promise of digitalizing old issues of newspapers, books, and other literature in the process of switching from physical and analog devices to digital. Additionally, a good omni-font optical character recognition tool could help automate reading.
While there were attempts in 1913 to create a reading machine by Dr. Edmund Fournier d'Albe - the machine ultimately was only able to read a word a minute and could only recognize a few letters with a very specific font. In 1974, Ray Kurzweil's company developed an omni-font OCR machine, that used to read text out loud, and dubbed it the "Kurzweil Reading Machine". It read 150 words per minute, while the average human could read 250.
At the time - this was a revolutionary invention, the size of an entire table. Now, you can quickly download OCR applications that decode images to text, in practically all languages in the world and offer translation services on the spot. Google Translate can even read text from your camera in real-time, and replace the text on the screen with an augmented, translated version. This works the best on street signs, which have a solid background (which can easily be extended over the original text) and by placing the recognized and translated text over it:
Real-Time OCR via Google Translate, translating from English to Japanese
It worked! "こんにちは世界" is correct, and it fully understood what I wrote! Applications like these have major implications for travel, human communication and understanding. While this is a hybrid application between computer vision and natural language processing, you have to understand what's in front of you to be able to translate it.
Image classification is the staple task and a very wide blanket. Image classification can be further applied onto various domains:
- Medical Diagnosis
- Manufacturing defect detection
- Alarm systems (burglar detection)
- Malware detection
In medical diagnosis, classifiers can learn to distinguish between images containing cancerous and non-cancerous cells, diabetes from retinopathy images, pneumonia from X-rays, etc. In manufacturing, classifiers can identify defective products, such as bottles without caps, broken toys, health convention violations, etc. Coupled with CCTV cameras, alarm systems can recognize the presence of a human after working hours, or even to distinguish a burglar from an employee and sound an alarm.
One major limitation of image classification is the implication that an image belongs to a class:
In each of these images, there's more going on then a single label! Typically, we intuitively know that something in an image belongs to a class, not the entirety of the image. Although an image classifier can be very powerful and find patterns that divide classes, for more realistic applications, we'll want to localize objects, and perhaps even predict multiple labels for these objects. For example, a shirt can be a blue, yellow, green, black, or any of various other colors. It can also be a long-sleeve shirt or a short-sleeve shirt. It can have X and Y or M and N. Predicting multiple applicable (non-exclusive) labels is known as multi-label image classification (as opposed to single-label classification) and the technique broadened the application of the type of system.
Additionally, you can classify an image to a label, and then localize what made the prediction go off. This is known as "classification + localization". However - in more advanced systems, we perform object recognition/detection, not classification.
Object and Face Detection
Object detection includes recognizing and classifying an object in an image, regardless of the rest of the image. For example, you can classify an entire image as a "mug" if there's a mug in it. Or, you can classify a mug as a "mug" within an image. The latter is much more accurate and realistic!
Now, this may sound like classification + localization with extra steps, but it isn't. We're not classifying an image and localizing what made the prediction go off. We're detecting an object within an image.
Object detection, Lesson 9
We'll cover YOLOv5, Detectron2, etc. in more depth in a later lesson, and train our own object detection network!
More and more, object detection is being applied to recognize objects from input. This can be corals at the bottom of the sea from a live feed camera (helping marine biologists preserve endangered species), pothole detection (helping humans and cars avoid them), weapon detection (helping keep venues safe), pedestrian detection (helping keep urban planning and traffic safe), etc.
Image classification is not being replaced by object detection - they're used for different tasks. Image classification used to be used for the tasks object detection is now better at, but this a branching of tasks - you'll train both classifiers and detectors as a computer vision engineer. Image classification is easier to explain and implement, so it's typically the first technique to be covered in most learning resources.
In a similar vein - face recognition technology is essentially object detection! First, the system has to detect a face, its position in the image, and then compare the embeddings of the image to a recorded set of embeddings belonging to an individual. Meta's face recognition algorithms used to offer to tag people automatically in images, but the feature was removed due to privacy concerns of a single company having access to billions of annotated images that can be quickly and easily linked to individuals. Naturally, the privacy concerns haven't only been raised against Meta - any form of identifying information is generally subject to privacy concerns.
Gallery applications on phones can detect and recognize selfies of you and your friends, creating folders that you can easily access. Face recognition can be used to perform visual identification (say, an employee's access to a certain part of a building), for law enforcement (could be a bit Orwellian, perhaps), and a plethora of other applications.
When you take a look in front of yourself and see an object - you're also aware of where it starts and where it ends. You're aware of the boundaries that define an instance of some object, and that when a "mug" ends, there might be some empty space until a "notebook" object, and that they're both resting on a "table".
This is segmentation! You're segmenting a mug from a notebook. Segmentation comes in a couple of flavors:
- Semantic segmentation
- Instance segmentation
- Panoptic segmentation (combination of Semantic and Instance segmentation)
Semantic segmentation involves segmenting which pixels belong to which class in an image. Instance segmentation involves segmenting which instances belong to which class in an image. The difference is similar to the difference between image classification and object detection. With semantic segmentation - it doesn't matter which chair is which, all that's important is that something belongs to the "chair" class. Instance segmentation will note the qualitative difference between each instance of a "chair". "This chair" and "that chair" aren't the same chair, even though they're next to each other.
You can think of image classification, classification + localization, object detection, semantic segmentation and instance segmentation as levels of microscopy you want to apply to an image. Segmentation is, in a sense, pixel-level classification while image classification classifies an entire image (and all of the pixels) from something (potentially) in the center of the image.
Later in the course, we'll be building an areal drone semantic segmentation model:
People move in predictable ways. If something's predictable, you can bet that someone's trying to make a model that predicts or estimates some value regarding it. Whether you want to predict someone's pose slightly in the future, or whether you want to quantify their presence - pose estimation is the process of estimating spatial locations of key body joints such as elbows, heads, knees, feet and hands.
Xbox and PlayStation used pose estimation for their Kinect and EyeToy products (both of which are now, sadly, retired) which allowed the consoles to track the movement of players and augment them into games. If you've had the chance to play with some of these, it's probably an experience you'll remember - at least I know I'll remember mine.
Pose estimation can be used to totally remove expensive tracking equipment for film-making and 3D movement tracking (although it's not quite there yet). If you no longer need to have a tracking suit to create CGI movies - recreating scenes in 3D will be significantly easier! Metaverse-type worlds where your movement is translated directly into the Euclidean space of another world will become more accessible (requiring just a simple camera worth a few dollars). In recent years, a whole new genre of entertainment online arose - VTubers (Virtual YouTubers) who use pose estimation from webcams to translate their movement onto virtual bodies (typically 2D) as a replacement for their own presence on streams. Personal identity and expression while retaining privacy can be redefined with such technology.
Finally, pose estimation can be transfered to robotics. It's conceivable that we could attempt training robots to perform human-like movements through training data of our own movement (quantified through pose estimation) or manually control them through our own movement, similar to how we'd control 3D models in a virtually augmented world.
If you own the physical copy of this course, you can view the Gif here.
Motion analysis builds upon object detection, motion detection, tracking and pose estimation. In human brains, the cerebellum performs adaptive prediction of movement, and it's pretty effortless for us. A ball falling downward will reach our hand in an approximately predictable manner. A soccer player uses this knowledge to kick a ball into the goal, and a goalkeeper uses this same knowledge to predict where the ball will be in a short while and will try to intercept it.
It's oftentimes understated how important this prediction ability is! Motion analysis, in a sense, a blanket term for multiple abilities that make such predictions possible.
Image Restoration and De-noising
Physical images suffer more than digital ones through time, but both physical images and digital ones can degrade if not protected properly. Hard drives can become faulty, electromagnetic waves can introduce noise, and physical images can oxidize, fade out, and get exposed to environments unfriendly to the printing paper.
Computer vision can be applied to restore, colorize and de-noise images! Typically, motion blur, noise, issues with camera focus, and physical damage to scanned images can be fairly successfuly removed. There are many papers dealing with image restoration (PapersWithCode task page), and the results are getting better by the day - here are the results of "Learning Deep CNN Denoiser Prior for Image Restoration" by Kai Zhang et al.:
"Learning Deep CNN Denoiser Prior for Image Restoration" by Kai Zhang, Wangment Zuo, Shuhang Gu and Lei Zhang
Soon enough, the movie scenes in which the secret intelligence agents "pause, rewind, enhance" a blurry image to a full HD one might not be too far away from reality!
Scene reconstruction is a fairly new and extremely exciting application of computer vision! We can construct a scene in our minds, given an image. When talking with someone, you know that they're not a flat, 2D piece of paper, and that if you were to circle around them, there's the back of their head. You're effectively constructing a 3D model of the world around you all the time - so, can we do this with code?
Turns out, we can. At least to a degree as of writing. Typically, this includes taking images from a couple of angles, from which a 3D structure of an object can be inferred. Then, a 3D mapping can be created. Some of the most amazing visuals were created by the Google Research team, and can be found at nerf-w.
Research has also been conducted in limiting the input to a single angle (single viewpoint), and while the methods are still new, they're promising.
Image Captioning, Image Generation, Visual Questions and Generative Models
Increasingly, vision is being combined with language. Another task relatively easy for us but (which used to be) hard for computers is image captioning. Given an image, you could describe what's going on if the image has enough context - can computers? Other than image captioning, given a description, we can imagine an image - can computers? If I ask you to describe a certain part of an image, or answer a question regarding its contents, you could - can computers?
This involves more than just vision - it involves a lot of context from natual language processing as well as a deep visual understanding. Up until recently, this was a far-fetched task, but with each passing year, there's rapid progress in this field, with some amazing results being released in 2022, as of writing.
For instance, DAMO Academy and the Alibaba Group released "Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-To-Sequence Learning Framework", in which their model (OFA) handles these exact tasks! It's got a ResNet backbone for computer vision (popular CNN architecture, covered in detail and used later in the course) and unifies it with other deep learning architectures to generate text and images based on the task at hand:
"Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-To-Sequence Learning Framework"
Needless to say - this is an exceedingly difficult goal, and reviewing the work in a compact manner doesn't do justice to the effort and scope of their work. Text-to-image conversion (which is really, a Sequence-to-Sequence conversion) has gotten a lot of attention in the recent years and months! A great example is the viral "Dream" application, developed by WOMBO. It allowed users to input seed text, from which lucid images were created. A user could pick between several "styles" to go from, such as "vibrant", "fantasy", "steampunk", "psychic", etc. Within seconds, with a real-time view of the patterns being brushed on the screen, a user would be prompted with a synthetically generated image, from the text they've input.
The code isn't open source, so we can only speculate how the team achieved this, but it appears as if they've performed Neural Style Transfer (transferring the style of an image to another one, typically used to convert images to another style, such as turning a regular photo into a Klimt-esque image), mixed with a text-to-image generator. What gave the application popularity is how mysterious and deeply beautiful the generated images were. They stirred emotions - fear, happiness, inquisitiveness, wonder. Truly, that's what art is about, isn't it?
Here are a few images, generate from the same prompt - "robot dreaming of electric sheep". While none of these feature a robot with a sleeping bubble with images of electric sheep, you can clearly see that both the "robot" and "sheep" patterns appear throughout several styles:
Images generated by the Dream Application
There's something deeply beautiful about these images, even though they could be touted as simple pattern recall with a bit of style. There's beauty in patterns, and pattern recall. We'll take a look at a great example of another type of image generation later, known as the Deep Dream algorithm, which embeds prior belief of patterns into an image, resulting in hallucinogenic views, and implement it.
Aleksa Gordić, a DeepMind research engineer, implemented the Deep Dream algorithm and has produced some of the most beautiful images I've ever seen with it:
Results of Aleksa Gordić's Implementation of Deep Dream
We'll be doing our own implementation as well:
We won't dwindle too much on the details of the implementation or how it works now - that's for a later lesson! If you're new to this - there's a bit of walking to do before running. Don't worry though, the road is pretty well-made and maintained!
Recently DALL·E 2, created by OpenAI has been making waves. It was released alongside "Hierarchical Text-Conditional Image Generation with CLIP Latents" and can create extremely well-crafted images from textual description. On their website (openai.com/dall-e-2), you can use their interactive demonstration to generate a few images by simply pointing and clicking:
Under the hood, it encodes the prompt you supply, and the encoding is mapped to an image encoding and added to an image of noise (random pixels), which is then used by a decoder to generate an image. One of the most interesting things about the generated images is that they're very plausible! While you probably won't see an astronaut riding a horse in space, it positions astronauts in the position they would be in if they were to ride a horse. If you request an image of fruit with hands - they're positioned in such a way to resemble what most would imagine fruits with hands looking like. This doesn't always work though.
Update 1: Since writing this lesson initially, Google released Imagen and Parti - two state-of-the-art caption-to-image models, just weeks apart. Shortly after that, StabilityAI released Stable Diffusion - a caption-to-image model that leverages the idea of diffusion models, like DALL·E 2. Diffusion models, put shortly, learn to de-noise input, from gaussian noise to an image, step by step. Stable Diffusion, like Imagen and DALL·E 2, use a Contrastive Language–Image Pre-training (CLIP) network as the text encoder, that outputs a latent representation of the input prompt.
Since the release, Twitter feeds are full of images generated by DALL·E 2, Stable Diffusion and comparisons of the results for the same prompts.
Update 2: In August of 2022, StabilityAI released the source code for Stable Diffusion, and hosted both an inference API and a
diffusers library on HuggingFace - a machine learning platform, community, model zoo and organization. The
diffusers API is beautifully simple, in true HuggingFace fashion, and you can download and run it with as little as:
$ pip install diffusers==0.2.4 transformers scipy ftfy
from diffusers import StableDiffusionPipeline
# Get your token at https://huggingface.co/settings/tokens
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_auth_token=YOUR_TOKEN)
prompt = "a photograph of an astronaut riding a horse"
image = pipe(prompt)["sample"]
Due to the obvious potential misuse of high-resolution, photorealistic image generation for arbitrary prompts - you'll have to sign up on HuggingFace and generate an access token to use the
StableDiffusionPipeline. A great notebook to start with is hosted on Google Colab by HuggingFace!
Text-to-image generation is pretty "hot" right now, and the second half of 2022 is defined with new research that boggles the mind. Exciting to see if anything new will be released by the end of 2022, and what the upcoming years will bring!
Note: Whenever a model is released with closed access, either to protect IP for a commercial product or for safety reasons - someone, somewhere, will implement an approximation of it and release it in the wild. An instance of this can also be seen with DALL·E 2 and Imagen, which are both closed for the wide public. Craiyon, formerly known as DALL-E mini was quick to provide a mini-DALL·E implementation for the masses. Similarly, the
diffusers library at HuggingFace is meant to democratize diffusion models for the public. Open source work is underapreciated, but propells the world forward by distributing knowledge, and thus, the power to act to everyone. You can learn more about diffusion models and the
diffusers library in another HuggingFace Google Colab. If someone from HuggingFace is reading this - you rock! <3
Update 3: Merely two weeks after this course was initially published in September 2022 - KerasCV, covered in a more detail later, has incorporated Stable Diffusion. SD was implemented in TensorFlow/Keras by Divam Gupta, a research engineer at Meta VR labs and the creator of Liner.AI. Just days layer, it was incorporated into KerasCV, which as of September 2022 provides the fastest, most efficient and simplest pipeline for stable diffusion:
from tensorflow import keras
# AMP for faster inference
# XLA on
model = keras_cv.models.StableDiffusion(jit_compile=True, img_width=512, img_height=512)
# Generate images
images = model.text_to_image("photorealistic cute bunny teaching deep learning for computer vision", batch_size=3)
The only thing left is to plot them:
for i in range(len(images)):
ax = plt.subplot(1, len(images), i + 1)
The ported implementation itself is only a few hundred lines long - 350 for the model and 100 for generation. It might be a bit difficult following it if you're not versed in CV yet, though. Since a good portion of the required knowledge will be covered - you should be able to read it without much issues by the end of the course.
If you're running the code on Google Colab, remember to install KerasCV and update TensorFlow/cuDNN:
! pip install keras_cv
! pip install --upgrade tensorflow-gpu
! apt install --allow-change-held-packages libcudnn8=126.96.36.199-1+cuda11.2
It's hardly possible to encapsulate the entire landscape of research and development in a single lesson, especially for a very volatile field like this. Yet - the list of tasks above is a fairly comprehensive list of some of the main applications of computer vision that you'll be encountering.
Classic Computer Vision Datasets
There are various datasets you could start playing with! Some are standardized as classic benchmarks, while some are simply popular with practicioners. There are too many to count and counting datasets would be like counting people. A 2-image dataset is a dataset! This is neither meant to be a "list of computer vision datasets", but simply includes some of the noteworthy or fun ones oftentimes used online.
Dogs vs. Cats
Dogs vs. Cats is oftentimes used for teaching purposes, and can typically be fit very easily to a practically 100% accuracy.
Hot Dog - Not Hot Dog
Hot Dog - Not Hot Dog is a joke binary classification dataset, inspired by a TV show, that lets you train a classifier to classify hot dogs and not hot dogs (everything else). Everything is either a hot dog, or isn't!
CIFAR10 and CIFAR100
CIFAR10 and CIFAR100 are two datasets created by researchers from the Canadian Institute For Advanced Research. Both datasets have 60k images, 32x32 in size (fairly small). CIFAR10 has ten classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck), while CIFAR100 has 100 classes. The 100 classes are organized into 20 coarse labels such as "food containers" each of which contains fine classes such as "bottles", "bowls", "cans", "cups" and "plates".
The datasets use as a great starting point! CIFAR10 is not too hard to fit, since the features that make a horse are fairly different from the ones making up a frog. It's not too hard to get a decent accuracy for it even for beginners who are just starting out. CIFAR100 is much harder to fit (making you rethink your architecture and look for ways to understand it better), in part due to the fact that it has 10 times the classes of CIFAR10 and the same number of images, making the leap from 6k per class to only 600 per class. Additionally, if you're doing fine-label prediction, the distinction between a cup or a can isn't too obvious. We'll talk more about these two datasets and train a classifier for each in the next lesson.
ImageNet is the current flagship for computer vision training and evaluation. It used to be the largest and most realistic datasets we had, but much larger datasets exist today, including datasets with billions of images. Some datasets are locked to company employees, while some are fully public. ImageNet is one of the largest public datasets, with 1000 classes, spanning over 14.2M images in real sizes which, in addition, vary between images. They're not all uniformly sized.
Most models are trained and evaluated on and for ImageNet, and most research papers use it as the benchmark. If you make a breakthrough model - this is where you'll test it! It's worth noting just how huge of a dataset it is. 14.2M images might not sound like a lot, but as soon as you start working on computer vision problems, you'll realize that you're extremely lucky if you have a hundred thousand images (and that it's also already a difficult number to deal with computationally), and just lucky to have 10k. Whenever you're dealing with a pre-trained network in computer vision, you'll typically be able to load in ImageNet weights and transfer this knowledge to other datasets. We'll dive into transfer learning soon enough, in a later lesson, so we won't focus on that now. It's worth noting upfront that Transfer Learning is another large part of why computer vision has advanced so much in the past years and some luminaries consider Transfer Learning to be one of (if not) the most important topic to cover in educational resources.
Imagenette is a subset of ImageNet. Jeremy Howard, the author of Fast.ai, created Imagenette so that he could test out prototypes of new networks faster. Training on ImageNet takes a lot of time, and his personal advice on making sure that tests can be made within a couple of minutes for quick validation couldn't be applied to such a huge dataset. Making experimentation and validation faster, and democratizing datasets like this is a huge part of why AI researchers have been able to advance the field so rapidly in the years leading up to now.
MS Coco: Common Objects in Context is a fairly large dataset of, well, objects in context. Context is really important for any machine learning system, and computer vision in particular. COCO has 330k images, 200K of which are labelled (labeling is extremely expensive). If labeled, images, have bounding boxes for one or multiple objects in the image, 5 captions per image and the dataset features 1.5M object instances. While CIFAR10, CIFAR100 and ImageNet are aimed at classification (whether an image belongs to a class) - COCO can be used for various other applications such as object detection, caption generation and instance segmentation.
The landscape changes through time. The best way to stay on top of the datasets you can use to train your models is to search for them. There's enough to go around, and it's oftentimes best to find a dataset in a specific niche you're interested in. For instance, you might be interested in applying computer vision to medical diagnosis (we'll cover an entire end-to-end project for breast cancer diagnosis later), self-driving cars, pose estimation for translating human movement into 3D, art generation, etc.
Depending on the specific thing you want to build - there's likely a dataset out there, ready for you to tackle it. Some of them won't be pretty - they'll be full of misshapen data, broken or too few images or most importantly - they might lack labels or even have wrong labels, etc. This is currently one of the biggest hurdles to get over. Finding a good way to get labelled data is hard, and some are switching to unsupervised methods in an attempt to circumvent the issue. As of writing, MS COCO takes ~50GB of storage space.
The MNIST hand-written digits and Fashion-MNIST datasets are very well known, and simple datasets. They're mainly used for educational purposes, and originally created a long time ago while CNNs were still in their infancy.
Google's Open Images
The Open Images dataset is a 9M image dataset with almost 16M bounding boxes on 600 categories, 2.8M instance segmentation masks on 350 categories, 3.3M relationship annotations for 1.5K relationships, 700k localized narratives and 60M image-level labels on 20k categories. This amount of meta-data allows us to create a wide variety of computer vision applications!
In sheer scale, it's comparable to ImageNet, but like MS COCO it provides bounding boxes, relationship annotations, instance segmentation, etc. Oh, and it's 565GB in size if you decide to download it locally.
Searching for Datasets
Having a list of some popular datasets is great. But - how do you search for them? You'll be doing a lot of searching, curation and creation in the future, so having at least a broad idea of places that offer high quality datasets is a great starting place!
Other than the platforms highlighted in this section - Google is your best friend.
Kaggle is one of the world's largest Data Science/ML platforms in the world, with a thriving community. It offers over 50k datasets (in all domains of Data Science), typically created by regular users but also research teams, companies and institutes.
Kaggle is also known for holding competitions with very decent prizes, depending on the budgets of the companies and teams that approach them. This way - eager data scientists can work on the burning unsolved problems in the world, without strings attached, gain rewards for their work, and companies/research teams can crowd-source solutions to problems that can help the world (or increase profits).
At any given point, you'll find several competitions on Kaggle that last for a few months with prize pools reaching $50-75k (though, most give you "knowledge" and "swag"), and thousands upon thousands of teams enrolling and competing to produce the best models from medicine and medical diagnosis, to stock exchange predictions, image matching, identifying near-extinct species of animals and preserving them to identifying and localizing plants and animals from satellite images.
Kaggle has a CLI that allows you to programatically download datasets (helping automate Data Science pipelines) and provides users with an environment in which to run notebooks free of charge, with a weekly quota for free GPU usage (the number of hours depends on the availability). It's more than safe to say that Kaggle plays an important part in the proliferation, democratization and advancement of Data Science all around the world.
We'll be working with Kaggle CLI and Kaggle datasets later in the course.
HuggingFace is a primarily NLP-based community, with some computer vision datasets. However, as noted in earlier sections, computer vision is being combined with NLP in an increasing rate. For visual question answering, image captioning, and similar tasks - you'll probably want to at least peruse HuggingFace.
While offering a "modest" 4.5K datasets as of writing, HuggingFace is gaining more and more traction and attention from the community, and it's worth having it on your radar for the days to come.
A large part of HuggingFace's philosophy is democratization of knowledge, and it's a lovely community to be in, with open source implementations and models based on cutting-edge research, and trained for weeks of months on clusters of GPUs for the public.
TensorFlow Datasets is a collection and corpora of datasets, curated and ready for training. All of the datasets from the module are standardized, so you don't have to bother with different preprocessing steps for every single dataset you're testing your models out on. While it may sound just like a simple convenience, rather than a game-changer - if you train a lot of models, the time it takes to do overhead work gets beyond annoying. The library provides access to datasets from MNIST to Google Open Images (11MB - 565GB), spanning several categories such as Audio, D4rl, Graphs, Image, Image Classification, Object Detection, Question Answering, Ranking, Rlds, Robomimic, Robotics, Text, Time Series, Text Simplification, Vision Language, Video, Translate, etc.
As of 2022, 278 datasets are available and community datasets are supported, with over 700 HuggingFace datasets and the Kubric dataset generator. If you're building a general intelligent system, there's a very good chance there's a public dataset there. For all other purposes - you can download public datasets and work with them, with custom pre-processing steps. Kaggle, HuggingFace and academic repositories are popular choices.
Another amazing feature is that datasets coming from TensorFlow Datasets are optimized. They're packed into a
tf.data.Dataset object, with which you can maximize the performance of your network through pre-fetching, automated optimization (on the back of TensorFlow), easy transformations on the entirety of the dataset, etc. and you can "peel away" the TensorFlow-specific functionality to expose the underlying NumPy arrays which can generically be applied to other frameworks as well.
We'll be working with TensorFlow datasets as well later in the course.
Google Research and Cloud Datasets
Google's own datasets can be found at research.google/tools/datasets, alongside other tools and services. There's "only" slightly above 100 datasets as of writing, but these aren't small datasets, and indluce behemoths such as Google Open Images, Covid-19 Open Data, YouTube-8M, Google Landmarks, etc.
Quite literally the "Google" of datasets, created by Google and accessible under datasets.research.google.com! It searches for datasets from a wide variety of repositories, including Kaggle, academic institutions, and even finds the associated scholarly articles published regarding a found dataset.
This isn't a curated list - it's a search engine for datasets mentioned in other places with extra useful metadata such as the license, authors, context, content explanation, date of upload, website it's hosted on, etc.
When just starting out, you don't really need a lot of tools. When you're starting out with pottery - having some clay, water and a flat surface is quite enough. Though, as you practice, you'll likely naturally want to get some tools - a wire to separate your creation from the surface, carving tools for details, a spinning wheel, etc.
For beginners, using a lot of tools can be overwhelming, and most skip them in lieu of trying things out. This is fine, but it's worth keeping some tools in mind for later use when you feel the need for them. Some of these are fairly mandatory, like the use of OpenCV or Pillow, though, you'll only really need a few methods, and anything above that is great but not necessary.
Note: This section will be updated through time.
Currently, KerasCV is under construction. It's a horizontal add-on to Keras, specifically meant to enable making industry-grade computer vision applications easier.
It'll feature new layers, metrics, losses and data augmentation building blocks that are too specialized for general Keras, but very applicable and can be broadly used in Computer Vision tasks. While it's still only under construction - it's on the radar of many, including myself. When it gets released, this course will be updated.
In the meantime, the course will dedicate a lesson to KerasCV and the new layers currently built into the beta version. A lesson on KerasCV is included.
OpenCV and Pillow
Originally created by Intel, OpenCV is a real-time computer vision library with support for a wide variety of computer vision tasks. While it is an entire self-contained ecosystem - practicioners can decide to use it for image loading and processing, before feeding them into their own applications, frameworks and models. OpenCV is highly performant and established in the computer vision community, with ports to multiple languages. It contains many modules that span from core functionalities such as reading and writing images/processing them to clustering and search in multi-dimensional spaces, a deep learning module, segmentation methods, feature detection, specialized matrix operations, etc. You can do Computer Vision in OpenCV exclusively if you want to.
On the other hand, you have Pillow! Pillow is a fork of PIL (Python Image Library) and is used for image processing. It's not a computer vision library - it's an image processing library. The API is simple and expressive, and it's a very widely used library in the community.
There's no competition between OpenCV and Pillow - they're different libraries used for different tasks. A portion of OpenCV overlaps with Pillow (image processing), but that's about it. Choosing between OpenCV and Pillow is more akin to choosing between a foldable knife and a swiss army knife. Both of them can cut stuff, but one of them also has a bottle opener, a can opener, and might even have a fork hidden inside! If you're just cutting, both will do the job just fine.
Throughout the course, we'll mainly be performing just image processing, so going with either Pillow or OpenCV makes sense. I personally prefer using OpenCV because because of the more low-level API, but if you're new to this, Pillow has a more forgiving learning curve (and is less complicated) and the results you'll get are pretty much the exact same anyway.
There are a couple of small different implementation details to note, such as that OpenCV natively uses the BGR format, not the RGB format (which most others use). Most libraries will detect this and load the images in just fine so to the eye, there is no difference. Though, when you try to infer a class from an image, it'll most likely be totally wrong, since the different format produces different results.
Both APIs are simple and similar, but Pillow's API is simpler and less verbose. In most cases, the OpenCV API calls the central module to which you provide objects for processing:
img = cv2.resize(img, (width, height))
While for Pillow, you call the methods on the objects themselves:
img = img.resize((width, height))
Both libraries can read resources, transorm images, change formats, translate, flip, rotate and all the other good stuff you'd like to do. A good analogy is that OpenCV is to Computer Vision what Scikit-Learn is to Machine Learning. A well-rounded, all-around applicable library with strong traditional foundations and wrappers/support for new approaches, such as neural networks.
TensorFlow Debugger (tfdbg)
Debugging TensorFlow models isn't fun. Debugging itself is never fun - but there's a special place in my heart for debugging TensorFlow models. It can't be overstated how much high-level APIs (such as Keras) with selective customization of the underlying components made development easier, more accessible and more elegant.
You'll naturally be much less likely to introduce bugs into your code with Keras, since a lot of the not-easy-to-write implementations are optimized and bug-free through it, but you'll eventually probably work with the lower-level API either in a search of more control, or out of necessity. At that point - using the TensorFlow Debugger (tfdbg) will help you keep your peripherals safe from yourself in a gust of frustration. Patience is a virtue for debugging.
A learning rate finder, popularized by Jeremy Howard of Fast.ai, is a nifty tool for finding an optimal starting learning rate! We'll cover a the relevant portion of the original research paper, concept behind it and an implementation for the tool with Keras later in the course.
TensorFlow Datasets (tfds)
tfds is a module that allows you to access, download and extract data from the TensorFlow Datasets repository. We'll work with it later in the course.
Know-Your-Data (TF KYD)
TensorFlow's GUI tool - Know Your Data, which is still in beta (as of writing), aims to answer important questions on data corruption (broken images, bad labels, etc.), data sensitivity (does your data contain sensitive content), data gaps (obvious lack of samples), data balance, etc.
A lot of these can help with avoiding bias and data skew - arguably one of the most important things to do when working on projects that can have an impact on other humans.
Not a tool, but a GitHub repository - EthicalML is a repository that acts as a list of open source tools and libraries that can act as a bounding landscape for you. It's run by the Institute for Ethical AI and Machine Learning.
From explainability tools to data pipelines and ETL, commercial platforms, serialization and versioning, etc., you can really get a good feel for production ML, MLOps (Machine Learning + DevOps), and links to sign up to some great newsletters such as the Machine Learning Engineer Newsletter, lead by Alejandro Saucedo. They send out some really good stuff!