David Landup

Introduction

<h3 id="lenet5ablastfromthepast1998">LeNet5 - A Blast from the Past (1998)</h3>
LeNet5 was created for recognizing hand-written digits, with a 28x28 size. It used <code>tanh</code> activations, and radial basis function for the final activation, instead of a more modern <code>softmax</code>, which outputs probabilities of each class. Distinctly, the network was originally written to use the Mean-Squared Error function as a loss function!
Other than these choices - the paper looks like a pretty standard modern deep learning paper. It discusses the data, the preprocessing steps taken, the architecture of the network, hyperparameters, and results. Now, the network itself was made for digit recognition, not classifying churches from cassette players, and it was tuned to work well for that dataset. At the time, classifying digits was a daunting task, and we couldn't expect the architecture to perform too well on something like Imagenette:
<pre><code class="hljs">lenet5 = keras.Sequential([
 keras.layers.InputLayer(input_shape=(None, None,3)),
 # Preprocessing made as part of the model itself
 # &#x27;tanh&#x27; didn&#x27;t do well with data augmentation
 # Resizing images down to 28, 28, since our input is 224, 224
 keras.layers.experimental.preprocessing.Resizing(28, 28),
 
 keras.layers.Conv2D(6, (5,5), padding=&#x27;same&#x27;, activation=&#x27;tanh&#x27;),
 keras.layers.AveragePooling2D((2,2)),
 
 keras.layers.Conv2D(16, (5,5), padding=&#x27;same&#x27;, activation=&#x27;tanh&#x27;),
 keras.layers.AveragePooling2D((2,2)),
 
 keras.layers.Conv2D(120, (5,5), padding=&#x27;same&#x27;, activation=&#x27;tanh&#x27;),
 keras.layers.Flatten(),
 keras.layers.Dense(84, activation=&#x27;tanh&#x27;),
 # LeCunn used Radial Basis Function, which isn&#x27;t built into Keras
 # Modern networks use &#x27;softmax&#x27;, so we&#x27;ll use that instead to 
 # avoid having to define a custom activation function for now
 keras.layers.Dense(10, activation=&#x27;softmax&#x27;) 
])

lenet5.compile(loss=&#x27;sparse_categorical_crossentropy&#x27;, optimizer=keras.optimizers.SGD(), metrics=[&#x27;accuracy&#x27;])
</code></pre>

LeNet5 - A Blast from the Past (1998)

<a rel="nofollow noopener noreferrer" target="_blank" href="https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf">AlexNet</a>, written by Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton, was released in 2012. At the time of writing, it's been a full decade since its release! It was successor to LeNet5 and competed in the 2012 ILSVRC challenge, beating the rest of the competitors by more than 10 percentage points in the top-5 error rate! While LeNet5 used a single convolution block, followed by average pooling, AlexNet used multiple stacked convolution layers. They highlighted how non-saturating <code>relu</code> helps train faster and produces more accurate networks than saturating <code>tanh</code>, after which <code>relu</code> has been used extensively.
This depth of the network was essential to the performance, at the cost of longer training with more parameters. It starts out with a fairly large kernel size (<code>11, 11</code>) and stride size (<code>4, 4</code>), and ends up with a much more common (<code>3, 3</code>) kernel size with a much smaller stride. The second convolutional block takes a normalized and pooled representation of the first, so we'll add a <code>MaxPooling2D</code> and <code>BatchNormalization</code> in between them.
The third, fourth and fifth convolutional layers are stacked on top of each other without any normalization or pooling. Finally, the maps are flattened and a dense classifier on top, with large dropouts (0.5) sprinkled in, is used. Since it was written for ImageNet - it has 1000 output classes, but for our dataset, we'll use an output of 10 classes.

AlexNet - Proving CNNs Can Do It (2012)

LeNet5 had, well, 5 layers. AlexNet had 8 layers. AlexNet was deeper and larger - and it performed better in terms of accuracy. So, which role does depth play with ConvNets? Karen and Andrew at Google DeepMind and Oxford hopped on with their own take on this question, pushing the layer count to 16 and 19.
These were unprecedented depths! With only 8 layers, AlexNet was already sitting at 60M parameters and was difficult to train. How does someone go from there to 19? They fixed some of the hyperparameters in place, and decided to go for depth, using &quot;very small&quot; kernel sizes (<code>3, 3</code>). One 7x7 convolutional layer can be replaced with three 3x3 layers, the sum of which introduce fewer trainable parameters than the one 7x7 layer. These smaller kernel sizes are what allowed this architecture to push the number of layers up to such a large number at the time.
In <a rel="nofollow noopener noreferrer" target="_blank" href="https://arxiv.org/pdf/1409.1556.pdf">&quot;Very Deep Convolutional Networks for Large-Scale Image Recognition&quot;</a>, a deep model architecture is outlined. An input layer (<code>224, 224, 3</code>), followed by a couple of convolutional layers, a max pooling layer, more convolutional layers, a max pooling layer... and so on, until a desired depth is achieved. This is followed by flattening and two fully connected layers, each with <code>4096</code> neurons (like AlexNet) and a classification layer with softmax. The paper includes an 11-layer, 13-layer, 16-layer and 19-layer schema, out of which, the 16 and 19-layer schemas are the most well known. They're oftentimes called VGG16 and VGG19. They competed in the 2014 ILSVRC challenge and were the runner up to the Inception architecture (covered later in this lesson).

VGGNet - Deeeeeeeeeep Networks (2014)

The Inception network competed in the 2014 ILSVRC challenge and outperformed VGGNets in terms of both accuracy and training speed, winning the number 1 position that year. The first network in the family tree is known as GoogLeNet, followed by InceptionV2, InceptionV3, InceptionV4 and Inception ResNet.
GoogLeNet is also known as InceptionV1. InceptionV2 and InceptionV3 are a redesign of the network and come joint in a single paper and InceptionV4/Inception ResNet come again in a single paper.
The GoogLeNet name is an homage to LeCun's LeNet5. However, the &quot;Inception&quot; name comes from the meme &quot;We need to go deeper&quot;, from the movie &quot;Inception&quot;, which was viral at the time, reflecting the fact that the network was made to... go deeper than previous networks. The paper that started the Inception tree was aptly named <a rel="nofollow noopener noreferrer" target="_blank" href="https://arxiv.org/pdf/1409.4842.pdf">&quot;Going deeper with convolutions&quot;</a> by Szegedy et al.
The authors noted that the most straightforward way to increase performance is to scale the network up - both in depth and in width. Scaling up means that you'll want to utilize the parameters more efficiently, lest you end up wasting precious compute. The authors of the Inception network note that you can fundamentally solve scaling issues with sparsely connected architectures, rather than densely connected architectures. With this in mind - GoogLeNet had only 6M parameters compared to 60M of AlexNet and 139M of VGGNet, and had better performance (accuracy, training and parameter). This was a nail in the coffin to the previous approach of scaling networks up.

Inception - From Meme to State of the Art (2014)

If you ran the VGG model, either on your own local setup or on the cloud - you've definitely noticed how much longer it took to train than AlexNet. VGGNet took more deliberate design as well - pretty precise weight decay on convolutional layers and testing out various SGD hyperparameters, mixed with the longer training times meant it was harder to iteratively improve the model's architecture. Additionally, the architecture suffered from depth degradation. Degradation occurred with deeper networks, which performed worse than their shallower counterparts. Similar to how vanishing gradients posed a hurdle for increasing network complexity (including depth), in a similar way, depth degradation posed a hurdle to scale networks in depth. This was a pretty big hurdle, since through the advent of VGGNet and Inception networks, it seemed as if depth is a very desirable property of a network.
Vanishing gradients were relatively easy to fix - normalization between layers and an effective activation function such as ReLU was enough to keep the gradients stable during training, even in more complex networks. On the other hand - what do we do about degradation? The first line of reasoning was that deeper networks, with more layers, lead to more overfitting. However - this was proven to be false, because the training error went up with network depth as well!

ResNet - The Start of a Lineage (2015)

Although ResNets were outperformed by the Inception network (right up next) - they became a defining cornerstone of computer vision, and are quite commonly used even today despite there existing architectures with better parameter utilization, faster training, and higher accuracy. While they are being phased out, their performance is decent, and more importantly, tweaks were made to the original architecture that boosted the performance significantly.
These variations kept ResNets very relevant to this day, and architectures like ResNeXt (which stacks parallely, rather then sequentially, similar to Inception) and variations on ResNeXts have achieve state-of-the art results in 2020!
Some tweaks are pretty small - some are fairly large. Let's take a look at one of the tweaks from <a rel="nofollow noopener noreferrer" target="_blank" href="https://arxiv.org/pdf/1812.01187v2.pdf">&quot;Bag of Tricks for Image Classification with Convolutional Neural Networks&quot;</a> by Tong He et al. They outline several tricks for training CNNs, and aggregate their results for ResNets, Inception and MobileNet. Their baseline is, naturally, much the same as we've seen it so far - image augmentation (random horizontal flipping, hue/saturation, normalization, etc.), nesterov-accelerated SGD with a momentum of 0.9, a starting learning rate of 0.1, a factor of 0.1 on reducing the learning rate when the network plateaus and a batch size of 256.

"Bag of Tricks for CNNs"

François Chollet, the author of Keras, developed a version of Inception - Extreme Inception, shortened to Xception. Inception was one of the first networks to not use the conventional model of stacking convolutions on top of each other naively. François notes something that's oftentimes forgot amongst newer practitioners - when we perform convolution on an input, the tensor that gets fed into the <code>Conv2D</code> layer is, in fact, 3D. <code>Conv2D</code> accepts a 4D tensor <code>(batch_size, width, height, channels)</code>, and performs convolutions on <code>(width, height, channels)</code>. It's called a <code>Conv2D</code>, and not a <code>Conv3D</code> because we only slide it on the X and Y axis - but over all the channels.
In other words - we force a convolutional layer to learn spatial correlations (width and height) and cross-channel correlations at the same time. The Inception architecture already offloaded some of this by introducing 1x1 convolutions, which effectively only map cross-channel correlations and perform dimensionality reduction. Building on top of the idea that cross-channel correlations and spatial correlations are decoupled enough for it to make sense for us to process them separately - François entertained the idea that they're decoupled enough that they can be mapped completely separately and called Inception modules that use this principle the &quot;extreme&quot; version of the Inception modules. If you run a 1x1 convolution to map cross-channel correlations, and then completely separately map the spatial correlations on every channel - you get something very similar to depthwise separable convolutions. Let's take a look at them first!

Xception - Extreme Inception (2016)

By 2016, a paradigm was clear in architectures - many architectures had some sort of short connection between the start and end of layers. Highway Nets, ResNets, FractalNets - they all, in one way or another, had shortcut connections. In <a rel="nofollow noopener noreferrer" target="_blank" href="https://arxiv.org/abs/1608.06993">&quot;Densely Connected Convolutional Networks&quot;</a> Gao Huang, Zhuang Liu et al. proposed another connectivity pattern - connecting all layers that could be connected (same feature map sizes). ResNets only had connections within blocks, and Block_1 only had a shortcut connection within itself. In DenseNets - Block_1 would also have a short connection to Block_2, Block_3 and Block_4. The same would go for Block_2 which would have a connection to Block_3 and Block_4 and so on. Naturally, Block_1 through Block_n would also have standard feedforward connections besides the short connections.
The key difference in the connection process here is that ResNets add outputs of layers. The output of a block is a summation of its input and the output of the processes within the block. With DenseNets - these connections are concatenated, not added. Thus, Layer N's inputs are the outputs of all preceding layers, not their sum. Because of this density - the networks are named DenseNets. It's interesting to see teams exploring density in connections, where some teams focus on sparsity in connections with both networks performing great! As a matter of fact - this density doesn't make DenseNets explode with parameters, so it's not going against the common wisdom of avoiding relearning features and they're actually quite efficient with their parameters. They're actually more efficient than ResNets in terms of parameter usage, since testing ResNets showed that many layers don't add too much value to the predictive power of the network whereas DenseNets don't suffer from this as much. It turns out that this density allows the network to not learn redundant feature maps, and that there's no need to make them wide. They only add a little bit of narrow &quot;collective knowledge&quot; in each block, since they carry over a lot of previous &quot;collective knowledge&quot;. During the implementation, you'll notice how few filters we're actually using.

DenseNet - Using Collective Knowledge (2016)

At this point - computer vision was catching quite a bit of wind. Models were getting more and more efficient in terms of parameters, and enough practical experience and theoretical knowledge was spread amongst a large population that further optimizations could take place. Just as computers were bulky and inefficient, so were computer vision networks. As computers became smaller - we put them in our pockets. In 2017, Andrew Howard et al. decided to put computer vision into our pockets as well.
In <a target="_blank" rel="nofollow noopener noreferrer" href="https://arxiv.org/pdf/1704.04861.pdf">&quot;MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications&quot;</a>, the authors note that while networks are getting more efficient and powerful, they're also getting larger. While this works great for tasks that aren't time-sensitive, other applications that are time sensitive couldn't rely on fast inference. To address the need to let smaller devices such as mobile phones and embedded devices see - MobileNet was created. The key takeaways are that MobileNet is small (easy to upload, download, load into memory and use) and fast (fast inference is required for mobile devices, where every millisecond of delay is felt). In real-time scenarios, being fast trumps being accurate, and &quot;good enough&quot; solutions are much more acceptable.

MobileNet - Allowing Mobile Devices to See (2017)

When designing networks, you test a lot of things out. When you try things out a lot, you look for ways to automate it. While you can use tuning frameworks such as Keras Tuner that allow you to input a space of different variations, and have it run them (randomly or guided by an optimization algorithm) - why should you have to create this space yourself? Well, you don't - and this idea is at the heart of Neural Architecture Search. NAS is not a new technique, and it's both a very wide and very diverse field, with a lot of research hours going into expanding it. NAS has proven to create state-of-the-art networks, by searching through a search space (network architectures) guided by a strategy (how it navigates the space) after which the found architecture is evaluated with an evaluation strategy and the search continues. Some NAS methods use reinforcement learning, while some use evolutionary algorithms such as genetic algorithms, while others use other classical optimization algorithms. Some believe that the prohibitive time it takes to perform NAS doesn't justify the performance benefits, while some believe it to be the future of architecture design. In any case - the search space is usually constrained. Either possible operations are given, or the search is made to be metric-aware (making sure that, for example, inference/training time doesn't explode into an unpractical range).

NASNet - Letting Machines Do (Part Of) The Work (2017)

Other than NASNet so far, most networks we've discussed follow a similar designing process. Some common block is made to be more efficient, and the network is scaled up. Scaling is typically done in depth or width. In 2019, Mingxing Tan, who worked on searching for MobileNetV3 and MNASNet paired up with Quoc V. Le to release <a rel="nofollow noopener noreferrer" target="_blank" href="https://arxiv.org/abs/1905.11946">&quot;EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks&quot;</a>. They note that scaling doesn't have to be one-dimensional, and that uniformly scaling depth, width and resolution can be done with a compound coefficient. This coefficient has been applied to ResNets and MobileNet as proof of concept, but more notably, the same coefficient is used to scale up a baseline network created by NAS. The scaled-up version of this baseline network resulted in a family of models known as EfficientNets - spanning from EfficientNetB0 to EfficientNetB7. EfficientNetB7 is the largest network of the family, and was 8.4x smaller and 6.1x faster than the second best network (in terms of accuracy) at the time. It's worth noting that larger EfficientNets aren't really meant to be trained on home setups and require a lot of VRAM. However, EfficientNetB0 is a decently sized model, sitting at 5.3M parameters, with a fairly fast inference time (about double that of MobileNet, which is already quite fast) and with a great training time, while being only a few dozen megabytes in size!

EfficientNet - Machines Do (Part Of) It Again (2019)

In recent years, Transformers have been &quot;stealing the show&quot; in NLP, but also notably in computer vision. Turns out - so called Vision Transformers (ViTs) can be used for computer vision, and they can perform quite well, at least for image classification. So well, in fact, that for some time, they've held most of the top spots in ImageNet leaderboards and still do. This is a new paradigm in computer vision and is still young, with much promise for the future. Plain ViTs haven't yet been applied as successfully to other computer vision tasks, such as segmentation or object recognition, in which CNN-based models are clearly in the lead. However, a type of Transformers, known as Swin Transformers, which are hierarchical, have been successfully applied as a generic vision backbone. While some practitioners consider themselves as &quot;belonging&quot; to a camp - CNN or Transformer - many are exploring the idea of combining them. Many top-performer architectures are based on a combination of CNNS and Transformers, as of writing. A great example of a combination network is CoAtNet, short for Convolutional Attention Network, which stacks convolutional layers and attention layers, unifying the two paradigms.
ConvNeXt came out in January of 2022 - just as the year started. It is a pure-conv network that was inspired by some of the new advances with ViTs, and appropriated a few concepts that helped it make a leap in CNN-based accuracy. Since then, within only 4 months, another 35 models have outperformed it, according to PapersWithCode. It's worth noting that these are Top-1 Accuracy reports - and that the difference between ConvNeXt (87.8%) and CoCa (91%) isn't a huge one. It's also worth noting that ConvNeXt was trained on 14M ImageNet images, while CoCa was trained on 3B images in JFT-3B (internal Google dataset). It's also worth noting that CoCa has a staggering 2.1B parameters (and some other Transformer-based architectures have 7.2B), while ConvNeXt has a &quot;mere&quot; 350M (at least ConvNeXt-XL does - one of the smaller variants has 22M).

ConvNeXt - A ConvNet for the 2020s (2022)

Throughout the lesson - we've gone on an adventure of ideas. In many cases, new architectures build on old ones. If you strip away the intricacies and look through a blurred lens - LeNet5 set a baseline. AlexNet was a beefed-up LeNet5. VGGNet was a deeper AlexNet. ResNet utilized a concept already known at the time, skip connections, in a different way. Xception took an existing hypothesis and took it to the extreme. DenseNet is conceptually a dense ResNet. This isn't to devalue the work or improvements of any of these. It is to note that when many people work on stuff, stuff gets better. Fresh perspectives always help and it's not something reserved for the intellectual elite. You can probably help and improve something that already exists. It's not super hard to find small optimization opportunities. For instance, it was fairly clear that the strided 1x1 convolutions in ResNets skip over much of the input, and fixing this wasn't very hard. Several small optimizations later - you have something better than anything before.
<blockquote>
Don't beat yourself into thinking that designing architectures is only for people with doctorates and an academic background. You can help make a difference.
</blockquote>
Even so, designing networks takes time - most won't want to invest it. With a rapid evolution of networks, as an engineer, you can realistically sit on the sidelines and wait for new architectures to pop up and use them in your own projects without really even reading much into them - as long as they provide more value to you than the previous one.

Designing Your Own Networks?

<h4 id="youdontneedtodeeplyunderstandanarchitecturetouseiteffectivelyinaproduct">You don't need to deeply understand an architecture to use it effectively in a product.</h4>
You can drive a car without knowing whether the engine has 4 or 8 cylinders and what the placement of the valves within the engine is. However - if you want to design and appreciate an engine (computer vision model), you'll want to go a bit deeper. Even if you don't want to spend time designing architectures and want to build products instead, which is what most want to do - you'll find important information in this lesson. You'll get to learn why using outdated architectures like VGGNet will hurt your product and performance, and why you should skip them if you're building anything modern, and you'll learn which architectures you can go to for solving practical problems and what the pros and cons are for each.
<blockquote>
If you're looking to apply computer vision to your field, using the resources from this lesson - you'll be able to find the newest models, understand how they work and by which criteria you can compare them and make a decision on which to use.
</blockquote>
I'll take you on a bit of time travel - going from 1998 to 2022, highlighting the defining architectures developed throughout the years, what made them unique, what their drawbacks are, and implement the notable ones from scratch. There's nothing better than having some dirt on your hands when it comes to these.

 <div class="alert alert-note">
 <div class="flex">
 
 <div class="flex-shrink-0 mr-3">
 <img src="/assets/images/icon-information-circle-solid.svg" class="icon" aria-hidden="true" />
 </div>
 
 You don't have to Google for architectures and their implementations - they're typically very clearly explained in the papers, and frameworks like Keras make these implementations easier than ever. The key takeaway of this lesson is to teach you how to find, read, implement and understand architectures and papers. No resource in the world will be able to keep up with all of the newest developments. I've included the newest papers here - but in a few months, new ones will pop up, and that's inevitable. Knowing where to find credible implementations, compare them to papers and tweak them can give you the competitive edge required for many computer vision products you may want to build.

 </div>
 </div>
 By the end - you'll have comprehensive and holistic knowledge of the &quot;common wisdom&quot; throughout the years, why design choices were made and what their overall influence was on the field. You'll learn how to use Keras' preprocessing layers, how KerasCV makes new augmentation accessible, how to compare performance (other than just accuracy) and what to consider if you want to design your own network.

 <div class="alert alert-note">
 <div class="flex">
 
 <div class="flex-shrink-0 mr-3">
 <img src="/assets/images/icon-information-circle-solid.svg" class="icon" aria-hidden="true" />
 </div>
 
 Note: This Guided Project is part of our in-depth course on <a target="_blank" href="https://stackabuse.com/courses/practical-deep-learning-for-computer-vision-with-python/">Practical Deep Learning for Computer Vision</a> and assumes that you've read the previous lessons or have that prerequisite knowledge from before.

 </div>
 </div>