Practical Deep Learning for Computer Vision with Python - Introduction to Computer Vision

Introduction to Computer Vision

David Landup
David Landup

The Role of Vision

Take a moment to look around you. There's stuff all around you! If you're sitting at a desk, you're likely in front of a monitor, and a keyboard - and maybe a hot (or cold) beverage of choice in a mug. If you're in public transport, there are pipes, shoes, people, seats.

How long did it take for you to infer what these objects are and how laborious was the process?

When you look down at your coffee mug, you're not at all surprised by the object in front of you. You've been able to do that since you were much younger than you are now. If you can read this - you've already trained yourself in visual pattern recognition, and context recognition, stringing together words in a sentence to form a coherent message. Way before you can read, even as a baby, you can distinguish between a cat and a dog, as well as between a bird and a whale. Whether you're aware of the labels we've assigned to them ("dog", "chien", "собака", "Hund", etc.) doesn't change the fact that the features that make up a dog are different from the features that make up a whale.

Vision is one of the dominant senses in humans.

Other senses, and more particularly, our perception of our surroundings depends on vision. "Seeing is believing" is a very common quote, implying that something doesn't exist if we can't see it. Naturally, we can't deny the existence of gravity or electrical forces because we just can't see them, and the quote is outdated. Additionally, when you see something while under the influence of psychodelics - it doesn't necessarily mean that it exists outside of and regardless of your own subjective experience. While arguments could be made against the semantics of the quote, such as that we see the effects of gravity, and therefore, we need to see to believe, the role of sight is integral to our perception of the world around us.

How Do we See?

It's fairly well known how our eyes work and how the signal gets to our visual cortex, that makes sense of it. Modeling the eye, we made extremely powerful and complex cameras, which work in much the same way.

While eyes have cones and rods that react to the different frequencies of visible light, cameras have arrays of light-responsive sensors (known as photodiodes). Cameras are, quite literally, mechanical eyes and work using the same principles as our organs!

Though, where things become somewhat unclear is after that. How are these "images" represented in our brains and how do we exactly make sense of them. In recent years, amazing advancements have been made into the theory of the visual cortex (and the neocortex in general), but there's still a lot left to be discovered. In computers, images are represented as sets of binary digits, which can be decoded into a set of RGB values that are to be displayed on a screen. This is partly due to the way sensors and monitors work - recording spatial intensity of light on the sensors and displaying that same information on a monitor.

There is no solidified scientific consensus on perception and consciousness.

There are many promising theories and advances are being made with each passing year, but we still lack a unified theory. However, tremendous strides have been made and some mechanisms of the brain have been decoded extremely well. How we perceive information can be decoupled from how we decode information, and we know a thing or two about how the visual cortex decodes the signals that come from our eyes.

The first big advancement in utilizing deep learning for computer vision came when Kunihiko Fukushima modeled the way (part of) the visual cortex works computationally, back in 1979. We're still riding on the avalanche started by that paper but we've also (recently) started exploring some other models on feature extraction. We'll come back to the Neocognitron later.

Computer Vision and Deep Learning

This segways us into Computer Vision! Computer Vision is a very large field, with a plethora of sub-fields. It was a notoriously difficult field to make advances in, especially before deep learning. Computer vision is difficult, and exciting.

How do you automate the process of object recognition, classification, segmentation and localization?

Humans are exceedingly good at spatial awareness even when standing still. You're aware of whether an object is closer or further away from you, as well as where it's located in respect to other objects. How do you exactly code this? Even if you had all the time in the world to make a rule-based system in which a straight line, with another straight line, that has a curvature on the side is a "mug" - you'd have a hard time thinking of all the possibilities and edge cases. In computer vision - rule-based systems were ruled out fairly quickly as a concept.

As often, we reasoned by analogy. We found what works well and tried modeling it.

From 1958 to 1968, David H. Hubel and Torsten N. Wiesel performed experiments with cats and monkeys, examining how their visual cortex works. Amongst other findings, their experiments reported that for both cats and monkeys, neurons in the visual cortex have a small local receptive field.

This underlies how our neurons perform decode visual input. Some neurons fire some of the time, on certain small features. Some neurons only fired at straight lines, while some fired only at curved lines. Throughout the experiments, they noticed that some neurons have a larger receptive field, and that these neurons typically reacted to more complex patterns, which were created by the more simple patterns that some other neurons fire for. These were, at the time, dubbed "simple cells" and "complex cells" in absence of a better term.

While the visual cortex isn't as simple and clear cut as "Neuron 1 fires only on horizontal lines, and should thus be named the horizontal neuron" - this discovery was crucial.

The visual cortex, like much of the rest of the neocortex, works with hierarchies, low-level features (straight or curved lines) feed into higher-level features (edges), which feed into higher-level features, until you know how to discern a cat from a dog or a whale, or the letter "A" from the digit "4" even though their constituent features are shared.

This was something to go from! And in 1979 Kunihiko Fukushima published the "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position". The architecture built on top of the idea of S-Cells (simple cells) and C-Cells (complex cells), where simple cells feed into complex cells, creating a hierarchy. It was a self-organizing system, into which you'd feed representative images. This was a precursor to modern Convolutional Neural Networks (CNNs), which have proliferated and advanced the field from recognizing basic shapes and patterns to superhuman ability in recognizing objects and localizing them in images and even live videos.

What was started by Fukushima was extended by Yann LeCunn, one of the "godfathers" of AI, in 1998 via Gradient-Based Learning Applied to Document Recognition, in which raw input is passed through a feature extractor and the resulting feature vector is passed through a trainable classifier module that determines the class of the input.

If you've worked with CNNs before - this will sound very familiar.

At this point, it's worth taking a step back and realizing that CNNs, like all models, are just models of how the visual cortex works. Just as a digital calculator and an analog calculator can both give the same results - the way they work towards that goal is inherently different. They're not a miniature visual cortex. They're a miniature visual-cortex inspired architecture, that yieds similar performance.

Alternatives to CNNs?

In general, there are two schools of thought:

  • We should reason by analogy
  • We should reason from first principles

Both methods work, most of the time. When it's unclear from which principles to go from, we might as well reason by analogy to establish a field, and then re-think the process from the ground up with newfound first principles. This is also exactly what happened with computer vision.

Before we had much of an idea of how things work at all - we tried modeling the visual cortex and translating the concept into something we can build as well. Recently, a totally different architecture has been proposed, built on the Attention Mechanism, and the architecture is known as a Transformer. It was proposed in 2017 by the Google Brain and Google Research teams in "Attention Is All You Need". A lot of attention is being shown to the promise of the new architecture and the field is divided - it's very unclear which type of architecture holds the most potential down the line.

For a while, it seemed that transformers are taking the lead and on a website such as PapersWithCode, you can see an up-to-date leaderboard of models and the architectures they're built on. In 2022, at the point in time in which transformers were at the top of the leaderboard, ConvNeXt, a "ConvNet for the 2020s", was released, humbly reminding everyone that there's still much to discover.

Recently - people have been making combinations of CNNs and Transformers! Nobody knows what the future holds, but it's exciting. It's a great idea to keep track of the leaderboard from time to time - it changes very frequently!

Computer Vision Tasks and Applications

Computer Vision is a vast field and can be applied to a wide variety of problems. We can automate a lot of actions we take such as pulling a level, clicking a button, sounding an alarm, etc. but it used to be very hard to automate the decision process that lead to those actions.

Clicking a button is simple! Recognizing a bear near the campsite, for which you'll want to sound an alarm, didn't use to be really simple for computers. While many "box" computer vision into mainly image classification (which factually is a big part of it, in different flavors), it's much much more than just that. A general rule of thumb is:

Can you see that something is different from something else?

If yes - you can apply computer vision to it. If not - you can still probably apply computer vision to it. A popular example is sound! If you join a competition to recognize the sounds of animals, such as birds - you'll find that it's easier to convert the sound into a soundwave or better yet - a spectrogram, and classify them using computer vision, instead of analyzing patterns in the sound sequences themselves.

Here's an example from a Kaggle Notebook written by Shreya, for the BirdCLEF competition:

Shreya's Visualization of bird sounds as a spectrogram

This is what an Ash-throated Flycatcher's ("astfly") sounds look like! Depending on the type of spectrogram you use, the images can very clearly delineate one species of bird from another. A similar technique was used by Ray Kurzweil's company that tried tackling understanding human speech back in the 1980s. They realized that it's easier to plot the frequencies wavelengths and classify phonemes based on those frequencies, rather than analyzing sequences with manually-coded expert systems from that time. In "How to Create a Mind", Kurzweil reflects on the technique:

A spectrogram of a person saying the word 'hide', as taken from Ray Kurzweil's book - "How to Create a Mind"

Computer vision can also be applied to unlikely domains such as malware detection, as proved by a number of researchers such as Mahmoud Kalash, Mrigank Rochan et al. in their paper titled "Malware Classification with Deep Convolutional Neural Networks", Joshua Saxe et al. in their paper titled "Deep neural network based malware detection using two dimensional binary program features", etc. If you turn malware binary into 8-bit binary vectors, and then turn those binary vectors into RGB - you can plot images of the source code. Now, these images don't really mean much to us, but distinct patterns occur in malware that don't appear in regular software.

Figure from "Malware classification with deep convolutional neural networks" by Mahmoud Kalash, Mrigank Rochan, et al.

Kalash and Rochan report a 98.52% and 99.97% accuracy, while many of the other papers report 95%+ accuracy rates on the relevant malware detection benchmarks.

Data visualization is an art in and of itself - and you can plot just about anything if you have data for it. If you can plot it, you can probably apply computer vision to it. In a sense, computer vision can be applied to practically all data - not just images. Images are just pixel intensity data anyway, so the fact that we can apply computer vision to images already itself implies that we can apply it to other data. Let's take a look at some popular tasks and applications of computer vision!

Optical Character Recognition

Optical character recognition was one of the first applications of computer vision! It held promise of digitalizing old issues of newspapers, books, and other literature in the process of switching from physical and analog devices to digital. Additionally, a good omni-font optical character recognition tool could help automate reading.

While there were attempts in 1913 to create a reading machine by Dr. Edmund Fournier d'Albe - the machine ultimately was only able to read a word a minute and could only recognize a few letters with a very specific font. In 1974, Ray Kurzweil's company developed an omni-font OCR machine, that used to read text out loud, and dubbed it the "Kurzweil Reading Machine". It read 150 words per minute, while the average human could read 250.

At the time - this was a revolutionary invention, the size of an entire table. Now, you can quickly download OCR applications that decode images to text, in practically all languages in the world and offer translation services on the spot. Google Translate can even read text from your camera in real-time, and replace the text on the screen with an augmented, translated version. This works the best on street signs, which have a solid background (which can easily be extended over the original text) and by placing the recognized and translated text over it:

Real-Time OCR via Google Translate, translating from English to Japanese

It worked! "こんにちは世界" is correct, and it fully understood what I wrote! Applications like these have major implications for travel, human communication and understanding. While this is a hybrid application between computer vision and natural language processing, you have to understand what's in front of you to be able to translate it.

Image Classification

Image classification is the staple task and a very wide blanket. Image classification can be further applied onto various domains:

  • Medical Diagnosis
  • Manufacturing defect detection
  • Alarm systems (burglar detection)
  • Malware detection
  • etc.

In medical diagnosis, classifiers can learn to distinguish between images containing cancerous and non-cancerous cells, diabetes from retinopathy images, pneumonia from X-rays, etc. In manufacturing, classifiers can identify defective products, such as bottles without caps, broken toys, health convention violations, etc. Coupled with CCTV cameras, alarm systems can recognize the presence of a human after working hours, or even to distinguish a burglar from an employee and sound an alarm.

One major limitation of image classification is the implication that an image belongs to a class:

In each of these images, there's more going on then a single label! Typically, we intuitively know that something in an image belongs to a class, not the entirety of the image. Although an image classifier can be very powerful and find patterns that divide classes, for more realistic applications, we'll want to localize objects, and perhaps even predict multiple labels for these objects. For example, a shirt can be a blue, yellow, green, black, or any of various other colors. It can also be a long-sleeve shirt or a short-sleeve shirt. It can have X and Y or M and N. Predicting multiple applicable (non-exclusive) labels is known as multi-label image classification (as opposed to single-label classification) and the technique broadened the application of the type of system.

Additionally, you can classify an image to a label, and then localize what made the prediction go off. This is known as "classification + localization". However - in more advanced systems, we perform object recognition/detection, not classification.

Object and Face Recognition

Object recognition includes recognizing and classifying an object in an image, regardless of the rest of the image. For example, you can classify an entire image as a "mug" if there's a mug in it. Or, you can classify a mug as a "mug" within an image. The latter is much more accurate and realistic!

Now, this may sound like classification + localization with extra steps, but it isn't. We're not classifying an image and localizing what made the prediction go off. We're detecting an object within an image.

MTheiler's Image from Wikipedia, highlighting object recognition capabilities of a YOLOv3 Network, Creative Commons

We'll cover YOLOv3 in depth in a later lesson, and train our own object recognition network!

More and more, object recognition is being applied to recognize objects from input. This can be corals at the bottom of the sea from a live feed camera (helping marine biologists preserve endangered species), pothole detection (helping humans and cars avoid them), weapon detection (helping keep venues safe), pedestrian detection (helping keep urban planning and traffic safe), etc.

Many equate "Image Classification" and "Object Recognition", but that simply isn't true, and it doesn't help that "image classification" is oftentimes called "image recognition". Image classification is not being replaced by object recognition - they're used for different things, and image classification used to be used for the tasks object recognition is now better at. Image classification is easier to explain and implement, so it's typically the first technique to be covered in most learning resources. We'll cover single-label and multi-label classification in the upcoming lessons, before covering object recognition both from images and video.

In a similar vein - face recognition technology is essentially object recognition! First, the system has to detect a face, its position in the image, and then compare the embeddings of the image to a recorded set of embeddings belonging to an individual. Meta's face recognition algorithms used to offer to tag people automatically in images, but the feature was removed due to privacy concerns of a single company having access to billions of annotated images that can be quickly and easily linked to individuals. Naturally, the privacy concerns haven't only been raised against Meta - any form of identifying information is generally subject to privacy concerns.

Gallery applications on phones can detect and recognize selfies of you and your friends, creating folders that you can easily access. Face recognition can be used to perform visual identification (say, an employee's access to a certain part of a building), for law enforcement (could be a bit Orwellian, perhaps), and a plethora of other applications.

Image Segmentation

When you take a look in front of yourself and see an object - you're also aware of where it starts and where it ends. You're aware of the boundaries that define an instance of some object, and that when a "mug" ends, there might be some empty space until a "notebook" object, and that they're both resting on a "table".

This is segmentation! You're segmenting a mug from a notebook. Segmentation comes in a couple of flavors:

  • Semantic segmentation
  • Instance segmentation

Semantic segmentation involves segmenting which pixels belong to which class in an image. Instance segmentation involves segmenting which instances belong to which class in an image. The difference is similar to the difference between image classification and object detection. With semantic segmentation - it doesn't matter which chair is which, all that's important is that something belongs to the "chair" class. Instance segmentation will note the difference between each instance of a "chair".

You can think of image classification, classification + localization, object detection, semantic segmentation and instance segmentation as levels of microscopy you want to apply to an image. Segmentation is, in a sense, pixel-level classification while image classification classifies an entire image (and all of the pixels) from something (potentially) in the center of the image.

Pose Estimation

People move in predictable ways. If something's predictable, you can bet that someone's trying to make a model that predicts or estimates some value regarding it. Whether you want to predict someone's pose slightly in the future, or whether you want to quantify their presence - pose estimation is the process of estimating spatial locations of key body joints such as elbows, heads, knees, feet and hands.

Xbox and PlayStation used pose estimation for their Kinect and EyeToy products (both of which are now, sadly, retired) which allowed the consoles to track the movement of players and augment them into games. If you've had the chance to play with some of these, it's probably an experience you'll remember - at least I know I'll remember mine.

Pose estimation can be used to totally remove expensive tracking equipment for film-making and 3D movement tracking (although it's not quite there yet). If you no longer need to have a tracking suit to create CGI movies - recreating scenes in 3D will be significantly easier! Metaverse-type worlds where your movement is translated directly into the Euclidean space of another world will become more accessible (requiring just a simple camera worth a few dollars). In recent years, a whole new genre of entertainment online arose - VTubers (Virtual YouTubers) who use pose estimation from webcams to translate their movement onto virtual bodies (typically 2D) as a replacement for their own presence on streams. Personal identity and expression while retaining privacy can be redefined with such technology.

Finally, pose estimation can be transfered to robotics. It's conceivable that we could attempt training robots to perform human-like movements through training data of our own movement (quantified through pose estimation) or manually control them through our own movement, similar to how we'd control 3D models in a virtuall augmented world.

If you own the physical copy of this course, you can view the Gif here.

Motion Analysis

Motion analysis builds upon object detection, motion detection, tracking and pose estimation. In human brains, the cerebellum performs adaptive prediction of movement, and it's pretty effortless for us. A ball falling downward will reach our hand in an approximately predictable manner. A soccer player uses this knowledge to kick a ball into the goal, and a goalkeeper uses this same knowledge to predict where the ball will be in a short while and will try to intercept it.

It's oftentimes understated how important this prediction ability is! Motion analysis, in a sense, a blanket term for multiple abilities that make such predictions possible.

Image Restoration and De-noising

Physical images suffer more than digital ones through time, but both physical images and digital ones can degrade if not protected properly. Hard drives can become faulty, electromagnetic waves can introduce noise, and physical images can oxidize, fade out, and get exposed to environments unfriendly to the printing paper.

Computer vision can be applied to restore, colorize and de-noise images! Typically, motion blur, noise, issues with camera focus, and physical damage to scanned images can be fairly successfuly removed. There are many papers dealing with image restoration (PapersWithCode task page), and the results are getting better by the day - here are the results of "Learning Deep CNN Denoiser Prior for Image Restoration" by Kai Zhang et al.:

"Learning Deep CNN Denoiser Prior for Image Restoration" by Kai Zhang, Wangment Zuo, Shuhang Gu and Lei Zhang

Soon enough, the movie scenes in which the secret intelligence agents "pause, rewind, enhance" a blurry image to a full HD one might not be too far away from reality!

Scene Reconstruction

Scene reconstruction is a fairly new and extremely exciting application of computer vision! We can construct a scene in our minds, given an image. When talking with someone, you know that they're not a flat, 2D piece of paper, and that if you were to circle around them, there's the back of their head. You're effectively constructing a 3D model of the world around you all the time - so, can we do this with code?

Turns out, we can. At least to a degree as of writing. Typically, this includes taking images from a couple of angles, from which a 3D structure of an object can be inferred. Then, a 3D mapping can be created. Some of the most amazing visuals were created by the Google Research team, and can be found at nerf-w.

Research has also been conducted in limiting the input to a single angle (single viewpoint), and while the methods are still new, they're promising.

Image Captioning, Image Generation, Visual Questions

Increasingly, vision is being combined with language. Another task relatively easy for us but (which used to be) hard for computers is image captioning. Given an image, you could describe what's going on if the image has enough context - can computers? Other than image captioning, given a description, we can imagine an image - can computers? If I ask you to describe a certain part of an image, or answer a question regarding its contents, you could - can computers?

This involves more than just vision - it involves a lot of context from natual language processing as well as a deep visual understanding. Up until recently, this was a far-fetched task, but with each passing year, there's rapid progress in this field, with some amazing results being released in 2022, as of writing.

For instance, DAMO Academy and the Alibaba Group released "Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-To-Sequence Learning Framework", in which their model (OFA) handles these exact tasks! It's got a ResNet backbone for computer vision (popular CNN architecture, covered in detail and used later in the course) and unifies it with other deep learning architectures to generate text and images based on the task at hand:

"Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-To-Sequence Learning Framework"

Needless to say - this is an exceedingly difficult goal, and reviewing the work in a compact manner doesn't do justice to the effort and scope of their work. Text-to-image conversion (which is really, a Sequence-to-Sequence conversion) has gotten a lot of attention in the recent years and months! A great example is the viral "Dream" application, developed by WOMBO. It allowed users to input seed text, from which lucid images were created. A user could pick between several "styles" to go from, such as "vibrant", "fantasy", "steampunk", "psychic", etc. Within seconds, with a real-time view of the patterns being brushed on the screen, a user would be prompted with a synthetically generated image, from the text they've input.

The code isn't open source, so we can only speculate how the team achieved this, but it appears as if they've performed Neural Style Transfer (transferring the style of an image to another one, typically used to convert images to another style, such as turning a regular photo into a Klimt-esque image), mixed with a text-to-image generator. What gave the application popularity is how mysterious and deeply beautiful the generated images were. They stirred emotions - fear, happiness, inquisitiveness, wonder. Truly, that's what art is about, isn't it?

Here are a few images, generate from the same prompt - "robot dreaming of electric sheep". While none of these feature a robot with a sleeping bubble with images of electric sheep, you can clearly see that both the "robot" and "sheep" patterns appear throughout several styles:

Images generated by the Dream Application

There's something deeply beautiful about these images, even though they could be touted as simple pattern recall with a bit of style. There's beauty in patterns, and pattern recall. We'll take a look at a great example of another type of image generation later, known as the Deep Dream algorithm, which embeds prior belief of patterns into an image, resulting in hallucinogenic views, and implement it.

Aleksa Gordić, a DeepMind research engineer, implemented the Deep Dream algorithm and has produced some of the most beautiful images I've ever seen with it:

Results of Aleksa Gordić's Implementation of Deep Dream

We won't dwindle too much on the details of the implementation or how it works now - that's for a later lesson! If you're new to this - there's a bit of walking to do before running. Don't worry though, the road is pretty well-made and maintained!

One last paper we'll take a look at is DALL·E 2, created by OpenAI. It was released under "Hierarchical Text-Conditional Image Generation with CLIP Latents" and can create extremely well-crafted images from textual description. On their website (, you can use their interactive demonstration to generate a few images by simply pointing and clicking:

I couldn't hope to encapsulate more than a sliver of active research in a section like this, but I hope that it painted a good picture of where the field is headed, and what it includes.

Computer Vision Datasets

There are various datasets you could start playing with! Some are standardized as classic benchmarks, while some are simply popular with practicioners. There are too many to count and we'll even create our own computer vision dataset using the Bing API in a project later, but some of the noteworthy ones include:

Dogs vs. Cats

Dogs vs. Cats: Dogs vs. Cats dataset is oftentimes used for teaching purposes, and can typically be fit very easily to a practically 100% accuracy.

Hot Dog - Not Hot Dog

Hot Dog - Not Hot Dog: A joke binary classification dataset, inspired by a TV show, that lets you train a classifier to classify hot dogs and not hot dogs (everything else).

CIFAR10 and CIFAR100

CIFAR10 and CIFAR100: Two datasets created by researchers from the Canadian Institute For Advanced Research. Both datasets have 60k images, 32x32 in size (fairly small). CIFAR10 has ten classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck), while CIFAR100 has 100 classes. The 100 classes are organized into 20 coarse labels such as "food containers" each of which contains fine classes such as "bottles", "bowls", "cans", "cups" and "plates". The datasets use as a great starting point! CIFAR10 is not too hard to fit, since the features that make a horse are fairly different from the ones making up a frog. It's not too hard to get a decent accuracy for it even for beginners who are just starting out. CIFAR100 is much harder to fit (making you rethink your architecture and look for ways to understand it better), in part due to the fact that it has 10 times the classes of CIFAR10 and the same number of images, making the leap from 6k per class to only 600 per class. Additionally, if you're doing fine-label prediction, the distinction between a cup or a can isn't too obvious. We'll talk more about these two datasets and train a classifier for each in the next lesson.


ImageNet: ImageNet is the current flagship for computer vision training and evaluation. It's the largest and most realistic dataset we have to date, with 1000 classes, spanning over 14.2M images in real sizes which, in addition, vary between images. They're not all uniformly sized. Most models are trained and evaluated on and for ImageNet, and most research papers use it as the benchmark. If you make a breakthrough model - this is where you'll test it! It's worth noting just how huge of a dataset it is. 14.2M images might not sound like a lot, but as soon as you start working on computer vision problems, you'll realize that you're lucky if you have a hundred thousand images (and that it's also already a difficult number to deal with computationally). Whenever you're dealing with a pre-trained network in computer vision, you'll typically be able to load in ImageNet weights and transfer this knowledge to other datasets. We'll dive into transfer learning soon enough, in a later lesson, so we won't focus on that now. It's worth noting upfront that Transfer Learning is another large part of why computer vision has advanced so much in the past years and some luminaries consider Transfer Learning to be one of (if not) the most important topic to cover in educational resources.


Imagenette: Imagenette is a subset of ImageNet. Jeremy Howard, one of the leading voices on AI and author of, created Imagenette so that he could test out prototypes of new networks faster. Training on ImageNet takes a lot of time, and his personal advice on making sure that tests can be made within a couple of minutes for quick validation couldn't be applied to such a huge dataset. Making experimentation and validation faster, and democratizing datasets like this is a huge part of why AI researchers have been able to advance the field so rapidly in the years leading up to now.

MS Coco

MS Coco: Common Objects in Context is a fairly large dataset of, well, objects in context. Context is really important for any machine learning system, and computer vision in particular. COCO has 330k images, 200K of which are labelled (labeling is extremely expensive). If labeled, images, have bounding boxes for one or multiple objects in the image, 5 captions per image and 1.5M object instances. While CIFAR10, CIFAR100 and ImageNet are aimed at classification (whether an image belongs to a class) - COCO can be used for various other applications such as object recognition (objects within images), caption generation (explaining what's in an image) and object segmentation (figuring out where an object is and where it ends in an image).

The landscape changes through time. While it's not likely that ImageNet will be dethroned tomorrow, since labelling 14.2M images takes time and vast resources - the best way to stay on top of the datasets you can use to train your models is to search for them. There's enough to go around, and it's oftentimes best to find a dataset in a specific niche you're interested in. For instance, you might be interested in applying computer vision to medical diagnosis (we'll cover an entire end-to-end project for breast cancer diagnosis later), self-driving cars, pose estimation for translating human movement into 3D, art generation, etc.

Depending on the specific thing you want to build - there's likely a dataset out there, ready for you to tackle it. Some of them won't be pretty - they'll be full of misshapen data, broken or too few images or most importantly - they might lack labels or even have wrong labels, etc. This is currently one of the biggest hurdles to get over. Finding a good way to get labelled data is hard, and some are switching to unsupervised methods in an attempt to circumvent the issue. As of writing, it takes ~50GB of storage space.


The MNIST hand-written digits and Fashion-MNIST datasets are very well known, and simple datasets. They're mainly used for educational purposes, and originally created a long time ago while CNNs were still in their infancy.

Google's Open Images

The Open Images dataset is a 9M image dataset with almost 16M bounding boxes on 600 categories, 2.8M instance segmentation masks on 350 categories, 3.3M relationship annotations for 1.5K relationships, 700k localized narratives and 60M image-level labels on 20k categories. This amount of meta-data allows us to create a wide variety of computer vision applications!

In sheer scale, it's comparable to ImageNet, but like MS COCO it provides bounding boxes, relationship annotations, instance segmentation, etc. Oh, and it's 565GB in size if you decide to download it locally.

Searching for Datasets

Having a list of some popular datasets is great. But - how do you search for them? You'll be doing a lot of searching, curation and creation in the future, so having at least a broad idea of places that offer high quality datasets is a great starting place!

Other than the platforms highlighted in this section - Google is your best friend.


Kaggle is one of the world's largest Data Science/ML platforms in the world, with a thriving community. It offers over 50k datasets (in all domains of Data Science), typically created by regular users but also research teams, companies and institutes.

Kaggle is also known for holding competitions with very decent prizes, depending on the budgets of the companies and teams that approach them. This way - eager data scientists can work on the burning unsolved problems in the world, without strings attached, gain rewards for their work, and companies/research teams can crowd-source solutions to problems that can help the world (or increase profits).

At any given point, you'll find several competitions on Kaggle that last for a few months with prize pools reaching $50-75k (though, most give you "knowledge" and "swag"), and thousands upon thousands of teams enrolling and competing to produce the best models from medicine and medical diagnosis, to stock exchange predictions, image matching, identifying near-extinct species of animals and preserving them to identifying and localizing plants and animals from satellite images.

Kaggle has a CLI that allows you to programatically download datasets (helping automate Data Science pipelines) and provides users with an environment in which to run notebooks free of charge, with a weekly quota for free GPU usage (the number of hours depends on the availability). It's more than safe to say that Kaggle plays an important part in the proliferation, democratization and advancement of Data Science all around the world.

We'll be working with Kaggle CLI and Kaggle datasets later in the course.


HuggingFace is a primarily NLP-based community, with some computer vision datasets. However, as noted in earlier sections, computer vision is being combined with NLP in an increasing rate. For visual question answering, image captioning, and similar tasks - you'll probably want to at least peruse HuggingFace.

While offering a "modest" 4.5K datasets as of writing, HuggingFace is gaining more and more traction and attention from the community, and it's worth having it on your radar for the days to come.

TensorFlow Datasets

TensorFlow Datasets is a collection and corpora of datasets, curated and ready for training. All of the datasets from the module are standardized, so you don't have to bother with different preprocessing steps for every single dataset you're testing your models out on. While it may sound just like a simple convenience, rather than a game-changer - if you train a lot of models, the time it takes to do overhead work gets beyond annoying. The library provides access to datasets from MNIST to Google Open Images (11MB - 565GB), spanning several categories such as Audio, D4rl, Graphs, Image, Image Classification, Object Detection, Question Answering, Ranking, Rlds, Robomimic, Robotics, Text, Time Series, Text Simplification, Vision Language, Video, Translate, etc.

As of 2022, 278 datasets are available and community datasets are supported, with over 700 Huggingface datasets and the Kubric dataset generator. If you're building a general intelligent system, there's a very good chance there's a public dataset there. For all other purposes - you can download public datasets and work with them, with custom pre-processing steps. Kaggle, Huggingface and academic repositories are popular choices.

Another amazing feature is that datasets coming from TensorFlow Datasets are optimized. They're packed into a object, with which you can maximize the performance of your network through pre-fetching, automated optimization (on the back of TensorFlow), easy transformations on the entirety of the dataset, etc. and you can "peel away" the TensorFlow-specific functionality to expose the underlying NumPy arrays which can generically be applied to other frameworks as well.

We'll be working with TensorFlow datasets as well later in the course.

Google Research and Cloud Datasets

Google's own datasets can be found at, alongside other tools and services. There's "only" slightly above 100 datasets as of writing, but these aren't small datasets, and indluce behemoths such as Google Open Images, Covid-19 Open Data, YouTube-8M, Google Landmarks, etc.


Quite literally the "Google" of datasets, created by Google and accessible under! It searches for datasets from a wide variety of repositories, including Kaggle, academic institutions, and even finds the associated scholarly articles published regarding a found dataset.

This isn't a curated list - it's a search engine for datasets mentioned in other places with extra useful metadata such as the license, authors, context, content explanation, date of upload, website it's hosted on, etc.

Useful Tools

When just starting out, you don't really need a lot of tools. When you're starting out with pottery - having some clay, water and a flat surface is quite enough. Though, as you practice, you'll likely naturally want to get some tools - a wire to separate your creation from the surface, carving tools for details, a spinning wheel, etc.

For beginners, using a lot of tools can be overwhelming, and most skip them in lieu of trying things out. This is fine, but it's worth keeping some tools in mind for later use when you feel the need for them. Some of these are fairly mandatory, like the use of OpenCV or Pillow, though, you'll only really need a few methods, and anything above that is great but not necessary.

Note: This section will be updated through time.


Currently, Keras-CV is under construction. It's a horizontal add-on to Keras, specifically meant to enable making industry-grade computer vision applications easier.

It'll feature new layers, metrics, losses and data augmentation building blocks that are too specialized for general Keras, but very applicable and can be broadly used in Computer Vision tasks. While it's still only under construction - it's on the radar of many, including myself. When it gets released, this course will be updated.

OpenCV and Pillow

Originally created by Intel, OpenCV is a real-time computer vision library with support for a wide variety of computer vision tasks. While it is an entire self-contained ecosystem - practicioners can decide to use it for image loading and processing, before feeding them into their own applications, frameworks and models. OpenCV is highly performant and established in the computer vision community, with ports to multiple languages. It contains many modules that span from core functionalities such as reading and writing images/processing them to clustering and search in multi-dimensional spaces, a deep learning module, segmentation methods, feature detection, specialized matrix operations, etc.

On the other hand, you have Pillow! Pillow is a fork of PIL (Python Image Library) and is used for image processing. It's not a computer vision library - it's an image processing library. The API is simple and expressive, and it's a very widely used library in the community.

There's no competition between OpenCV and Pillow - they're different libraries used for different tasks. A portion of OpenCV overlaps with Pillow (image processing), but that's about it. Choosing between OpenCV and Pillow is more akin to choosing between a foldable knife and a swiss army knife. Both of them can cut stuff, but one of them also has a bottle opener, a can opener, and might even have a fork hidden inside! If you're just cutting, both will do the job just fine.

Throughout the course, we'll mainly be performing just image processing, so going with either Pillow or OpenCV makes sense. I personally prefer using OpenCV because I've worked with it before, but if you're new to this, Pillow has a much more forgiving learning curve (and is less complicated) and the results you'll get are pretty much the exact same anyway.

There are a couple of small different implementation details to note, such as that OpenCV natively uses the BGR format, not the RGB format (which most others use). Most libraries will detect this and load the images in just fine so to the eye, there is no difference. Though, when you try to infer a class from an image, it'll most likely be totally wrong, since the different format produces different results.

Both APIs are simple and similar, but Pillow's API is simpler and less verbose. In most cases, the OpenCV API calls the central module to which you provide objects for processing:

img = cv2.resize(img, (width, height))

While for Pillow, you call the methods on the objects themselves:

img = img.resize((width, height))

Both libraries can read resources from URLs, transorm images, change formats, translate, flip, rotate and all the other good stuff you'd like to do.

TensorFlow Debugger (tfdbg)

Debugging TensorFlow models isn't fun. Debugging itself is never fun - but there's a special place in my heart for debugging TensorFlow models. It can't be overstated how much high-level APIs (such as Keras) with selective customization of the underlying components made development easier, more accessible and more elegant.

You'll naturally be much less likely to introduce bugs into your code with Keras, since a lot of the not-easy-to-write implementations are optimized and bug-free through it, but you'll eventually probably work with the lower-level API either in a search of more control, or out of necessity. At that point - using the debugger will help you keep your peripherals safe from yourself.


A learning rate finder, popularized by Jeremy Howard of, is a nifty tool for finding an optimal starting learning rate! We'll cover a the relevant portion of the original research paper, concept behind it and an implementation for the tool with Keras later in the course.

TensorFlow Datasets (tfds)

tfds is the Python module that allows you to access, download and extract data from the TensorFlow Datasets repository. We'll work with it later in the course.

Know-Your-Data (TF KYD)

TensorFlow's GUI tool - Know Your Data, which is still in beta (as of writing), aims to answer important questions on data corruption (broken images, bad labels, etc.), data sensitivity (does your data contain sensitive content), data gaps (obvious lack of samples), data balance, etc.

A lot of these can help with avoiding bias and data skew - arguably one of the most important things to do when working on projects that can have an impact on other humans.

Do I Need Expensive Equipment?

No. It's great if you have it, but it's not necessary. Having a tower build with 4 graphics cards won't make you a good deep learning engineer nor researcher - it'll just make the algorithms run faster.

Some datasets, to be fair, are possible but simply impractical to run on slower systems, and computer vision is generally best done with a GPU. If you don't have access to one at all - you can always use cloud-based providers. They're free. Platforms like Kaggle and Google Colab, at the time of writing, provide you with a weekly quota (in hours) of free GPUs you can use. You just connect to their cloud-based service, and run your notebooks. Even if you have a GPU, chances are that theirs are going to be better than yours. The selection of GPUs and access changes through time, so to stay up to date with their offerings, it's best if you visit the websites yourself.

Other providers do exist as well - and they typically offer a subscription that nets you access to better resources and/or have a payment model where you pay for each minute/hour you use their resources for. I purposefully won't mention or explicitly endorse any paid product for obvious reasons in the course, though, a quick Google search can find the competitive services.

Without a doubt - services like these substantially help democratize knowledge and access to resources, making research from any part of the world, from the comfort of your home very possible and plausible. You can get cutting-edge performance models within a reasonable timeframe, on a lot of the tasks you decide to dedicate your time to, with these services.

How the Course is Structured

The course is structured through Guides and Guided Projects.

Guides serve as an introduction to a topic, such as the following introduction and guide to Convolutional Neural Networks, and assume no prior knowledge in the narrow field.

Guided Projects are self-contained and serve to bridge the gap between the cleanly formatted theory and practice and put you knee-deep into the burning problems and questions in the field. With Guided Projects, we presume only the knowledge of the narrower field that you could gain from following the lessons in the course. You can also enroll into Guided Projects as individual mini-courses, though, you gain access to all relevant Guided Projects by enrolling into this course.

Once we've finished reviewing how they're built, we'll assess why we'd want to build them. Theory is theory and practice is practice. Any theory will necessarily be a bit behind the curve - it takes time to produce resources like books and courses, and it's not easy to "just update them".

Guided Projects are our attempt at making our courses stay relevant through the years of research and advancement. Theory doesn't change as fast. The application of that theory does.

In the following lesson, we'll jump into Convolutional Neural Networks - how they work, what they're made of and how to build them, followed by an overview of some of the modern architectures. This is quickly followed by a real project with imperfect data, a lesson on critical thinking, important techniques and further projects.

Lessson 2/13
You must first start the course before tracking progress.
Mark completed

© 2013-2022 Stack Abuse. All rights reserved.