DenseNet - Using Collective Knowledge (2016)
By 2016, a paradigm was clear in architectures - many architectures had some sort of short connection between the start and end of layers. Highway Nets, ResNets, FractalNets - they all, in one way or another, had shortcut connections. In "Densely Connected Convolutional Networks" Gao Huang, Zhuang Liu et al. proposed another connectivity pattern - connecting all layers that could be connected (same feature map sizes). ResNets only had connections within blocks, and Block_1 only had a shortcut connection within itself. In DenseNets - Block_1 would also have a short connection to Block_2, Block_3 and Block_4. The same would go for Block_2 which would have a connection to Block_3 and Block_4 and so on. Naturally, Block_1 through Block_n would also have standard feedforward connections besides the short connections.
The key difference in the connection process here is that ResNets add outputs of layers. The output of a block is a summation of its input and the output of the processes within the block. With DenseNets - these connections are concatenated, not added. Thus, Layer N's inputs are the outputs of all preceding layers, not their sum. Because of this density - the networks are named DenseNets. It's interesting to see teams exploring density in connections, where some teams focus on sparsity in connections with both networks performing great! As a matter of fact - this density doesn't make DenseNets explode with parameters, so it's not going against the common wisdom of avoiding relearning features and they're actually quite efficient with their parameters. They're actually more efficient than ResNets in terms of parameter usage, since testing ResNets showed that many layers don't add too much value to the predictive power of the network whereas DenseNets don't suffer from this as much. It turns out that this density allows the network to not learn redundant feature maps, and that there's no need to make them wide. They only add a little bit of narrow "collective knowledge" in each block, since they carry over a lot of previous "collective knowledge". During the implementation, you'll notice how few filters we're actually using.