ConvNeXt - A ConvNet for the 2020s (2022)
In recent years, Transformers have been "stealing the show" in NLP, but also notably in computer vision. Turns out - so called Vision Transformers (ViTs) can be used for computer vision, and they can perform quite well, at least for image classification. So well, in fact, that for some time, they've held most of the top spots in ImageNet leaderboards and still do. This is a new paradigm in computer vision and is still young, with much promise for the future. Plain ViTs haven't yet been applied as successfully to other computer vision tasks, such as segmentation or object recognition, in which CNN-based models are clearly in the lead. However, a type of Transformers, known as Swin Transformers, which are hierarchical, have been successfully applied as a generic vision backbone. While some practitioners consider themselves as "belonging" to a camp - CNN or Transformer - many are exploring the idea of combining them. Many top-performer architectures are based on a combination of CNNS and Transformers, as of writing. A great example of a combination network is CoAtNet, short for Convolutional Attention Network, which stacks convolutional layers and attention layers, unifying the two paradigms.
ConvNeXt came out in January of 2022 - just as the year started. It is a pure-conv network that was inspired by some of the new advances with ViTs, and appropriated a few concepts that helped it make a leap in CNN-based accuracy. Since then, within only 4 months, another 35 models have outperformed it, according to PapersWithCode. It's worth noting that these are Top-1 Accuracy reports - and that the difference between ConvNeXt (87.8%) and CoCa (91%) isn't a huge one. It's also worth noting that ConvNeXt was trained on 14M ImageNet images, while CoCa was trained on 3B images in JFT-3B (internal Google dataset). It's also worth noting that CoCa has a staggering 2.1B parameters (and some other Transformer-based architectures have 7.2B), while ConvNeXt has a "mere" 350M (at least ConvNeXt-XL does - one of the smaller variants has 22M).