Convolutional Neural Networks - Beyond Basic Architectures - Introduction

# Introduction

David Landup

### Convolutional Neural Networks - Beyond Basic Architectures

So far, we've been working with a very distinctive, very exemplary architecture. I've noted that it's fairly similar to the VGG architecture that used to reign supreme for a pretty short while, but which is slowly being phased out.

This sort of network is easy to understand because it's practically a 1-to-1 mapping to the most intuitive explanation of how CNNs work - through convolutional layers, pooling layers, flattening layers and a fully-connected layer. It's also the most intuitive to understand with a limited understanding of how the visual cortex works. If you're a neuroscientist - you've likely aggressively cringed at the simplification of the inner-workings of the visual cortex from earlier lessons. The concept of hierarchical representations is there - but that's where our implementation and the cortex part ways.

The architecture used so far is, in a sense, the most natural and gentle introduction to CNNs - conceptually and implementation-wise. It provides fairly decent performance (in terms of accuracy) but bottlenecks with the number of parameters. At this point, you've built multiple classifiers, all of them very capable. You've gotten introduced to the inner-workings of the classifiers, got exposed to latent space visualization, biases, challanged the notion that overfitting is bad, explored the implications of data augmentation and context, implemented a custom loss function and metric, explored class imbalance, and even wrote a research-grade classifier for Invasive Ductal Carcinoma!

In my quest to demystify deep learning for computer vision - another hurdle remains a black box system. I've mentioned various other architectures so far, with a promise that they'll be covered later. It might seem late to introduce them now since we've used them both for transfer learning and the breast cancer classification project - but it's exceedingly difficult to really appreciate some of the advancements made with these architectures without going through a real project and evaluating the performance of the different architectures. You don't need to deeply understand an architecture to use it effectively in a product. You can drive a car without knowing whether the engine has 4 or 8 cylinders and what the placement of the valves within the engine is. However - if you want to design and appreciate an engine (computer vision model), you'll probably want to go a bit deeper. Even if you don't want to spend time designing architectures and want to build products instead, which is what most want to do - you'll still find interesting information in this lesson. If nothing else - you'll get to learn why using outdated architectures like VGGNet will hurt your product and performance, and why you should skip them if you're building anything modern, and you'll learn which architectures you can go to for solving practical problems and what the pros and cons are for each. If you're looking to apply computer vision to your field, using the resources from this lesson - you'll be able to find the newest models, understand how they work and by which criteria you can compare them and make a decision on which to use.

Now, it's time to peel back and take a look at how they really work. In this lesson, I'll take you on a bit of time travel - going from 1998 to 2022, highlighting the defining architectures developed throughout the years, what made them unique, what their drawbacks are, and implement the notable ones from scratch. There's nothing better than having some dirt on your hands when it comes to these, and you've come to a point where you can really appreciate these additions and the benefits they give. Some architectures are more complex than others, and it would take a fair bit of theoretical underpinnings to properly implement them with all the quirks, so to stay true to the practical nature of the course, we won't implement all of them from scratch, but will linger enough to highlight their contributions.

Important: This lesson will serve as a guide through ideas and the progression of "common wisdom". Some architectures are factually more relevant and have left a more significant legacy than others, and I'll spend more time on those. For instance, plain ResNets don't offer cutting-edge performance anymore, but new tweaks, variants and combinations are still relevant today in 2022. Thus - investing more time into ResNets will lower your barrier to entry to newer architectures that leverage the concept of residual learning. I've gone through several dozen research papers to write this lesson, and many of them will be referenced throughout it. This is also a great time to go through them yourself and practice paper-reading! Reading papers is a skill in and of itself - they can be complex, full of technical lingo and otherwise hard to follow in some cases if you don't have extensive experience. Throughout the lesson, I'll break down some of the information from them into actionable, easy tasks that we can implement right away. You don't have to Google for architectures and their implementations - they're typically very clearly explained in the papers, and frameworks like Keras make these implementations easier than ever. The key takeaway of this lesson is to teach you how to find, read, implement and understand architectures and papers. No resource in the world will be able to keep up with all of the newest developments. I've included the newest papers here - but in a few months, new ones will pop up, and that's inevitable. Knowing where to find credible implementations, compare them to papers and tweak them can give you the competitive edge required for many computer vision products you may want to build.

We won't be implementing all of these in the lesson, since some implementations are fairly long and would necessitate a large time investment in each and every architecture, which would defeat the point. For all implementations - please inspect and play around with the associated Jupyter Notebook.

Besides the notebook - a great place to view concrete implementations is in Keras' applications! Through the official GitHub page, you can access all of the implemented architectures. There's a necessary amount of software engineering overhead in each class - such as for defining different versions, encapsulating logic, and using optional features such as naming blocks that might make these classes seem super complex. Though, when you focus on the essence - none of these are inherently harder than what we've built so far. These will all use the Functional API from a recent TensorFlow version, since it's much more expressive and most of these architectures aren't sequential, but it's amazing practice to copy the code and "skim off" the non-essential parts to get a good feel for them.

For example, this piece of code might seem much more complex than it is quite literally due to formatting:

if input_tensor is None:
img_input = layers.Input(shape=input_shape)
else:
if not backend.is_keras_tensor(input_tensor):
img_input = layers.Input(tensor=input_tensor, shape=input_shape)
else:
img_input = input_tensor
# Block 1
x = layers.Conv2D(
64, (3, 3), activation="relu", padding="same", name="block1_conv1"
)(img_input)
x = layers.Conv2D(
64, (3, 3), activation="relu", padding="same", name="block1_conv2"
)(x)
x = layers.MaxPooling2D((2, 2), strides=(2, 2), name="block1_pool")(x)
...


You can simplify it down to:

img_input = layers.Input(shape=input_shape)

x = layers.Conv2D(64, (3, 3), activation="relu", padding="same")(img_input)
x = layers.Conv2D(64, (3, 3), activation="relu", padding="same")(x)
x = layers.MaxPooling2D((2, 2), strides=(2, 2))(x)
...


Now - the original wrapper checks for input types and adjusts the code to be compatible with different types. This is the software engineering aspect of model development. In this lesson - we'll mainly be interested in the essence of these models - not the required but oftentimes obfuscating software engineering aspect. Cross-checking these implementations with the papers will allow you to truly grasp how an architecture works. Once you've intuitively grasped these - go ahead and add a check for input types and expand the functionality of your model!

Note: As of writing, KerasCV is in the making, and it has a lot of overlap with Keras Applications. As per the roadmap - once KerasCV applications have matured enough (unknown when), a deprecation warning will be added to Keras Applications. This doesn't change the fact that Keras Applications are a really good repository for reading and cross-checking for architectures, but depending on when you're reading this - KerasCV might've already overtaken. You can stay updated by following the documentation and/or the KerasCV GitHub repository.

Here's a timeline of notable developments, as per PapersWithCode:

This leaderboard is measured by Top-1 Accuracy - thought, most benchmarks for ImageNet use Top-5 Error Rate to gauge network performance instead. The leaderboard is different depending on which metric you look at, and we'll be using the latter when referring to a network outperforming another.

This is the only lesson in which I'll include the outputs of some of the summary() calls, since the focus of the lesson is on the architectures themselves, and inspecting the summary helps a lot here. For those that are unwieldy for showing in the lesson for formatting reasons - inspect them on your local machine.

Note: Leaderboards change all the time. The "best" image model is "best" only for a short time and many comparisons are outdated. A recent notebook by Jeremy Howard compares various image models with an interactive visualization.

### Datasets for Rapid Testing

All modern architectures are tested on ImageNet, amongst other benchmarks like CIFAR10 and CIFAR100. Most models that are ported and built into libraries like Keras are pre-trained on ImageNet as well. Every year, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is held, in which teams race to compare their shiniest new models. The tests on ImageNet are so popular that a review paper was made, called "Do ImageNet Classifiers Generalize to ImageNet?". Spoiler alert - they do (but, they're still sensitive to data cleaning and the way the datasets are prepared, which can lead to drops in accuracy).

The main problem is - ImageNet is large. Very large. 14M images to be more precise. The most widely used subset of ImageNet, known as ImageNet-1K (1000 classes, 1.2M images for training, 50k for validation, 100k for testing), takes up roughly 150GB of HDD space. The larger subset, ImageNet-21K contains over a TB of data.

While the average cost per gigabyte has been falling over the years, down to around $0.04/GB today (making a 150GB HDD device cost around$6) - the issue isn't in storing the dataset. The main hurdle remains processing it all. It's a hassle to train networks on so many images, especially if you're trying things out, tweaking them a bit here and there and training again while learning. So, what other options do you have?

There are various benchmark datasets you can go with, but they're generally in a different domain than ImageNet, and performance on them doesn't have to translate to ImageNet. While your end goal shouldn't be ImageNet (but generic vision), if it performs well there, there's a higher probability it would perform well on other generic datasets. For rapid testing though - you'll want something smaller. CIFAR10 and CIFAR100 have really small images, which isn't realistic, and they're somewhat synthetic.

#### Tiny ImageNet-200

A few years ago, Fei-Fei Li, Andrej Karpathy, and Justin Johnson compiled Tiny ImageNet-200! A 200-class version of ImageNet, with downsized images (64x64), where each class has 500 samples for training and 50 samples for testing - so in total, 100k training images, and 10k test images. It's available on Kaggle and Stanford's servers at: http://cs231n.stanford.edu/.

The dataset is a valid alternative to ImageNet, but can still take a fair bit of time to train on if you're training many models while just learning, like we'll be training in this lesson.

#### ImageNet Resized

ImageNet Resized is similar to Tiny ImageNet-200 in that the images are resized down to 8x8, 16x16, 32x32 and 64x64 - but it contains 1.2M images as per ImageNet-1K! While the labels are jumbled up (you won't be able to use the decode_predictions() function to get accurate labels), you can write your own wrapper function for them. Alas - the main downside to this dataset, like the previous one, is that the images are unrealistically small and that it takes even longer to train on this dataset since it contains around 12x the data as Tiny ImageNet-200. You're free to follow this lesson with either of these datasets, if you're ready to wait a few hundred hours for all of the models to train on them.

ImageNet Resized is available via TensorFlow Datasets as well, making it a super simple way to load and work with it:

(train_set, test_set, valid_set), info = tfds.load("imagenet_resized/64x64",
split=["train[:80%]", "validation", "train[80%:]"],
as_supervised=True, with_info=True)

class_names = info.features["label"].names
n_classes = info.features["label"].num_classes
print(f'Class names: {class_names}', ) # ['n02119789', 'n02100735', 'n02110185', ...]
print('Num of classes:', n_classes) # Num of classes: 1000

print("Train set size:", len(train_set)) # Train set size: 1024934
print("Test set size:", len(test_set)) # Test set size: 50000
print("Valid set size:", len(valid_set)) # Valid set size: 256233


The 64x64 version takes up around 13GB of space, 32x32 takes up around 3.5GB, 16x16 takes up a bit less than a GB and 8x8 takes up around 300MB.

fig = tfds.visualization.show_examples(train_set, info)
plt.show()


#### Imagenette

Prototyping is really important while designing networks, and while learning, you'll benefit from fast training. For this exact reason - Jeremy Howard, the author of fast.ai, created Imagenette! A small version of ImageNet, just 3GB in size (7K for training, 2.3K for testing/validation), with 10 classes. The major upside of this dataset is that it uses native image sizes! They're not awkwardly resized to small unrealistic images.

Note: It's worth noting that in the last couple of years, major strides have been made in speeding up training. It used to take 14 days to train a ResNet50 (which we'll be implementing from scratch shortly) on an M40 GPU, for 90 epochs. Notably, Jeremy Howard, Andrew Shaw and Yaroslav Bulatov managed to train an ImageNet classifier to 93% top-5 accuracy in 18 minutes using 128 NVIDIA V100 GPUs, via AWS' 16 cloud instances, which cost them only 40. This is possible because we aren't as much limited with hardware anymore - we're more limited algorithmically. The team used several techniques such as - progressive resizing, rectangular image validation, Tencent's weight decay tuning, Google Brain's dynamic batch sizes and gradual learning rate warm-up to make this possible. For more information - you can read Jeremy's blogpost here. We'll cover some of these techniques in the lesson on optimizing model training later on. If you don't feel like paying, investing a lot of time in optimization from the get-go (don't optimize too early) - you'll probably want to try things out on Imagenette, and only move onto the more time-expensive and resource-expensive endeavor when you think you've got a hit! The point is to minimize the time between testing architectures out and getting to the stage where you'll want to apply several optimization techniques and pay for an instance in the cloud to train your model on. Let's load Imagenette in through TensorFlow Datasets (makes it so much easier and allows us to prefetch the data seamlessly): (train_set, test_set, valid_set), info = tfds.load("imagenette", split=["train[:70%]", "train[70%:]", "validation"], as_supervised=True, with_info=True) class_names = info.features["label"].names n_classes = info.features["label"].num_classes print(f'Class names: {class_names}', ) # ['n01440764', 'n02102040', 'n02979186', 'n03000684', 'n03028079', 'n03394916', 'n03417042', 'n03425413', 'n03445777', 'n03888257'] print('Num of classes:', n_classes) # Num of classes: 10 print("Train set size:", len(train_set)) # Train set size: 7102 print("Test set size:", len(test_set)) # Test set size: 947 print("Valid set size:", len(valid_set)) # Valid set size: 1420  Since the labels are inherited from ImageNet and aren't very readable - let's create a dictionary of labels to readable names, and a method to fetch the name based on the label: labels = { 'n01440764' : 'tench', 'n02102040' : 'English springer', 'n02979186' : 'cassette player', 'n03000684' : 'chain saw', 'n03028079' : 'church', 'n03394916' : 'French horn', 'n03417042' : 'garbage truck', 'n03425413' : 'gas pump', 'n03445777' : 'golf ball', 'n03888257' : 'parachute' } def label_to_classname(label): return labels[label] label_to_classname('n03425413') # 'gas pump'  Let's visualize a batch of 25 images: fig = plt.figure(figsize=(10, 10)) for index, entry in enumerate(test_set.take(25), start=1): sample_image = np.squeeze(entry[0].numpy()[0]) sample_label = label_to_classname(class_names[entry[1].numpy()[0]]) ax = fig.add_subplot(5, 5, index) ax.imshow(np.array(sample_image, np.int32)) ax.set_title(f"Class: {sample_label}") ax.axis('off') plt.tight_layout() plt.show()  Note: There's one downside to using Imagenette. These are easily identifiable classes! There's a pretty big difference between a parachute and a garbage truck. Any network that can at least kind of get the difference between these will get a similar score to a network that would be able to also discern between a garbage truck and a cement mixer truck, which the first wouldn't be able to do. Because of this - there's a serious effect of diminishing returns when it comes to accuracy metrics and it can feel like the newer architectures aren't really all that better than older ones. You can try this same lesson out on the "food101" dataset, which contains a much larger set of images, with 101 food categories (some of which are fairly similar), and 101K images, "imagenet_resized/nxn". Though, be weary that this'll probably take much longer than most are willing to invest in training time while learning. Being able to quickly prototype, test things out and observe results in this case trumps a linear sense of progress between implementations. #### Keras' Preprocessing Layers For the preprocessing step - we'll only crop and resize the images since they're not uniform in size. Due to the lack of support for batching and otherwise working with datasets of varying image sizes with TensorFlow, you'll generally want to resize them to the same shape: def preprocess_image(image, label): resized_image = tf.image.resize(cropped_image, [224, 224]) return resized_image, label train_set = train_set.map(preprocess_image).batch(16).prefetch(tf.data.AUTOTUNE) test_set = test_set.map(preprocess_image).batch(16).prefetch(tf.data.AUTOTUNE) valid_set = valid_set.map(preprocess_image).batch(16).prefetch(tf.data.AUTOTUNE)  For the augmentation itself - naturally, it'll help, so we'll want to use it. Though, since the tf.image API is still fairly limited (fixed angles, for instance), we'll use Keras' augmentation capabilities instead. This marks the third TF-Keras option for data augmentation, and that's using Keras Preprocessing Layers: • Resizing • Rescaling • RandomFlip • RandomRotation • etc. They'll be accessible via keras.layers.LayerName or keras.layers.experimental.preprocessing.LayerName, depending on the version of TensorFlow/Keras you're using. The amazing thing about them is - you can use them through a standalone preprocessing step or you can make them part of the model itself! That technically makes the latter the fourth way to perform data augmentation. So, which one should you use anyway? Any of them that make sense for you. Seriously, there's no "right" way between these - only more or less versatile ways. The tf.image API is hands down the least expressive of the bunch. Augmentation through the ImageDataGenerator works best when loading from directories, so if you don't have directories of images, you're not really likely to convert a tf.Dataset into NumPy arrays to feed them into an ImageDataGenerator. In our case, when working with a tf.Dataset object, using the preprocessing layers makes a lot of sense - they're expressive (like the ImageDataGenerator augmentation options, and more expressive than tf.image), and can be both used outside or inside of the model! If you use them as a distinct part of the model, they'll be saved within the architecture, so you can make it much more robust and general in terms of input. Instead of processing it before feeding into the model for inference, you can simply feed the raw image and have the model work out the rest. It may seem "odd" to have the model "see" the image, and then rotate it before passing it through other layers, but really, the net result is quite literally the same as if you've passed it through a preprocessing step before that. I personally like including layers within the models themselves, because it makes them more robust to input. Additionally, you can tweak the layers instead of the data when trying to optimize the augmentation parameters. You'll have to rebuild the model anyway to not carry over previous training, and this way - the rebuilding is contained in the model - no need to reload the training data. Additionally - having external preprocessing functions, while technically more flexible, is cumbersome. There shouldn't be a need for users to pass input through dedicated, different functions before using models. This makes it especially annoying if you're serving models in non-Python environments. By bundling as much as you can within a model, more can be done with less code. To create true end-to-end models, that genuinely accept raw images (without a bunch of preprocessing on the server before it gets passed to the model), you'll want to use preprocessing layers. This builds a strong case for using Keras preprocessing layers as a preferred format for data preprocessing and augmentation. If you have huge datasets, you'll benefit from using a data generator, such as ImageDataGenerator, but you don't have to apply any augmentations there - you can still use preprocessing layers! In KerasCV, over 28 new layers are being added, as of writing, and this list can also expand in the future! With so many options, it's reasonable to expect Keras preprocessing layers to become much more common in standard pipelines and architectures in the future. Note: Before applying augmentation - try training without it for a baseline. #### KerasCV Preprocessing Layers KerasCV introduces new preprocessing layers and metrics (amongst other planned additions)! You can find the expanding list of features in the official documentation, and naturally, KerasCV layers work just like regular Keras layers. They're separated out to a different package because they were a bit too CV-specific to be included in the main Keras package, but were important enough to warrant being implemented officially as well. As research progresses, so do training techniques, and KerasCV implements some of the "hot new" techniques in terms of preprocessing and augmentation as well. Instead of manually implementing these - just plug and play a layer such as: keras_cv.layers.RandAugment(value_range, augmentations_per_image=3, magnitude=0.5, ...)  RandAugment is meant to be the "holy grail" of augmentation layers. It applies a random augmentation on an input image, including other augmentation/preprocessing layers. CutMix, MixUp and RandAugment have recently been catching traction, and you might want to try them out in some of the code samples from this lesson as well. ### Optimizers, Learning Rate and Batch Size? Datasets affect learning - nothing new there. All hyperparameters are in for some tuning whenever you change a dataset, and what worked the best in a paper probably won't reflect in a 1:1 mapping to your own dataset. In this lesson, we'll be referencing many papers and their hyperparameters. They probably won't work for you straight out of the box. There are, thankfully, rules of thumb to follow when translating findings to your own local environment. First off - optimizers. In earlier lessons, I've noted that Adam is a pretty solid default optimizer. Most papers you'll see in this lesson use SGD, with a momentum of 0.9 (and Nesterov acceleration). Some of these papers were released before Adam, so that makes sense but some were released after Adam and still didn't use it for training. In some cases, you can replace the SGD with Adam and it'll work great. In some cases, you won't be able to, such as with AlexNet and VGG, due to their large parameter counts. Many modern networks have much fewer parameters, so you'll probably be able to use Adam with them. The question is - should you? The answer isn't clear cut if you read papers. There's a place for both Adam and SGD... and RMSprop, and other optimizers, even today. SGD has been observed to generalize better than Adam by the end of training in studies such as "The Marginal Value of Adaptive Gradient Methods in Machine Learning". Adam trains and converges faster than SGD, especially in the initial stages of training, but generalizes worse according to the authors. In another camp, in "On Empirical Comparisons of Optimizers for Deep Learning", the authors note that there is currently no theory that explains which optimizer you'll want to choose, and they've empirically tested widely used optimizers. The takeaway of their research is that Adam never underperforms SGD. How can someone make such a confident statement? Adam and RMSprop can simulate SGD with momentum. Thus, SGD with momentum can be seen as a special case of Adam. A general optimizer, according to the authors, can perform at least as good as one of its special cases, when tuned to simulate it. The issue is - tuning Adam to simulate SGD can be even more finicky than tuning SGD to perform well, so in those cases, you might as well just use SGD. In conclusion - who should you trust? Yourself. Try Adam. Try SGD. Try RMSprop. Compare the results, tune them, then compare the results again. Do this over time, as your data changes and shifts. What's stable once might not be as stable later. Next - learning rates! When translating learning rates from research papers to your own code, you can follow the linear scaling rule, for non-adaptive optimizers. If someone uses a learning rate of 0.1 on a batch size of 256 - and you use a batch size of 128, your learning rate should be set to 0.05. $$LR2 = LR1*(b2/b1)$$ Where your learning rate (LR2) is linearly increased or decreased compared to the original learning rate (LR1) based on the ratio of your batch size to the original batch size (b2/b1). For Adam, this isn't as important, and you're quite likely going to do just fine with the default learning rate of 1e-3 for most tasks, though, you might want to try going down to steps between 1e-3 and 1e-5 as well. Finally, batch sizes! I've noted before that you probably don't want to go above 32. Most papers here will use a batch size of 256. Why? Larger batch sizes allow for better parallelization and lead to faster training. But, they also lead to worse generalization and sharper minima, according to "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima". Flat minima is preferable to sharp minima because they're more robust, and a model can be potentially further tuned in that valley, rather than "hitting the walls" around it in a sharp minima point. In "Practical recommendations for gradient-based training of deep architectures", Yoshua Bengio notes that any batch size above 10 utilizes at least some of the matrix-matrix product optimizations, and that 32 is a good default batch size. For all practical purposes - a larger batch size means stronger hardware. You probably won't be able to run large batch sizes on your home computer or even on many cloud-based providers. Using a smaller batch size, between 8 on the lower end and 32 on the higher end is a pretty safe bet. If your computer can't handle a batch size of 32 - most cloud-based providers can, even for larger images and more expensive architectures. ### Performance Metrics When talking about the performance of a network, the term "performance" is tied to a metric. Depending on which metric you tie it to - one network "outperforms" the other. It's worth noting what metrics people oftentimes take into consideration, since if your conception of the metric is misaligned with the author's, you might be disappointed: • Computational efficiency • Parameter efficiency • Top-K Accuracy Rate • Top-K Error Rate • Convergence Speed • Training Speed • Inference Speed • FLOPs Oftentimes, you'll find "better efficiency" or "faster" attached to an architecture, as compared to another. Are they talking about parameter efficiency? Computational efficiency? Sometimes, it's unclear. Additionally, does better efficiency mean faster training? Not necessarily. It's mainly tied to how efficiently an operation can be done, but many efficient operations might run slower or cost more than a single inefficient one. Finally - efficiency doesn't necessarily mean that a network is lightweight. Even more efficient networks might need more VRAM to work than their inefficient counterparts. Computational efficiency is concerned with the efficiency with which computation is performed. Parameter efficiency is concerned with how efficiently parameters are being utilized. If many parameters are near-0, they aren't adding much to the network, and could technically be pruned away. As a rule of thumb - the lower the parameter count, and the higher the accuracy is, the better the parameter efficiency. Top-K accuracy and Top-K error rate are the two faces of the same coin. Top-1 Accuracy leaderboards are typically different from Top-5 Accuracy leaderboards. A network might be better at approximately guessing the class compared to a network that confidently says the class but misses all other classes. Pretty commonly, Top-5 Error Rate is used instead of Top-5 Accuracy. A jump from 94% to 95% in accuracy intuitively feels lesser than a drop from 6% to 5% and it's easier to quantify the progress that way. We're more concerned with what the network got wrong, rather than the bulk that it got right, since they're already pretty good at getting things right. Convergence speed is concerned with how many epochs it takes to converge and find an acceptable minima. Even if it takes longer to train an epoch than with another network - if it converges in fewer epochs, it has better convergence speed. Training speed is concerned with training speed per epoch/sample. Inference speed is concerned with how fast a network can perform inference in production (some are too slow to be practical for real-time usage or for mobile devices). Finally - FLOPs (Floating Point Operations Per Second). The lower you have, the less compute you're using! It is desirable to have fewer FLOPs because why do something in 10 steps if you can do it in 1? That being said - having a more efficient or faster network can mean various things, depending on which metric you're tying the comparison to. ### Where to Find Models? Other than the official implementations - where can you find models? So-called "model zoos" are a good place to take a look at, and TensorFlow Hub is one of the largest hubs to find pre-trained models that you can download and deploy with ease. You can search for models, collections and/or publishers and some collections contain dozens of models ready for you to download and plug in. For instance, we'll be covering ConvNeXt later in the chapetr, and due to it being fairly new, it's not as widely adopted officially as it will be some time in the near future. When a new model gets released - you don't have to wait for an official implementation, nor implement and train it yourself. Hop onto TensorFlow Hub and find it there. If GUIs aren't your thing - you can download models straight into your project via an internet connection and the tensorflow_hub tool:  pip install --upgrade tensorflow_hub


import tensorflow_hub as hub