Xception - Extreme Inception (2016)
François Chollet, the author of Keras, developed a version of Inception - Extreme Inception, shortened to Xception. Inception was one of the first networks to not use the conventional model of stacking convolutions on top of each other naively. François notes something that's oftentimes forgot amongst newer practitioners - when we perform convolution on an input, the tensor that gets fed into the Conv2D
layer is, in fact, 3D. Conv2D
accepts a 4D tensor (batch_size, width, height, channels)
, and performs convolutions on (width, height, channels)
. It's called a Conv2D
, and not a Conv3D
because we only slide it on the X and Y axis - but over all the channels.
In other words - we force a convolutional layer to learn spatial correlations (width and height) and cross-channel correlations at the same time. The Inception architecture already offloaded some of this by introducing 1x1 convolutions, which effectively only map cross-channel correlations and perform dimensionality reduction. Building on top of the idea that cross-channel correlations and spatial correlations are decoupled enough for it to make sense for us to process them separately - François entertained the idea that they're decoupled enough that they can be mapped completely separately and called Inception modules that use this principle the "extreme" version of the Inception modules. If you run a 1x1 convolution to map cross-channel correlations, and then completely separately map the spatial correlations on every channel - you get something very similar to depthwise separable convolutions. Let's take a look at them first!