ResNet - The Start of a Lineage (2015)
If you ran the VGG model, either on your own local setup or on the cloud - you've definitely noticed how much longer it took to train than AlexNet. VGGNet took more deliberate design as well - pretty precise weight decay on convolutional layers and testing out various SGD hyperparameters, mixed with the longer training times meant it was harder to iteratively improve the model's architecture. Additionally, the architecture suffered from depth degradation. Degradation occurred with deeper networks, which performed worse than their shallower counterparts. Similar to how vanishing gradients posed a hurdle for increasing network complexity (including depth), in a similar way, depth degradation posed a hurdle to scale networks in depth. This was a pretty big hurdle, since through the advent of VGGNet and Inception networks, it seemed as if depth is a very desirable property of a network.
Vanishing gradients were relatively easy to fix - normalization between layers and an effective activation function such as ReLU was enough to keep the gradients stable during training, even in more complex networks. On the other hand - what do we do about degradation? The first line of reasoning was that deeper networks, with more layers, lead to more overfitting. However - this was proven to be false, because the training error went up with network depth as well!