Semantic and Instance Segmentation is the natural next step of object detection, and uses much the same architectures with new heads to predict masks, rather than bounding boxes. Many object detection architectures can be converted into segmentation architectures, and some projects ship both capabilities such as Detectron2.
We've worked with YOLOv5 by Ultralytics in a previous project, which currently doesn't support segmentation, but it is in the works. When it gets released in a later version, I'll update the course.
<h3 id="segmentationarchitectures">Segmentation Architectures</h3>
As with every task - there are various architectures that can be employed to perform segmentation. Some of the defining ones are:
<ul>
<li>Mask R-CNN - A Faster R-CNN variant which predicts masks for detected objects. Mask R-CNNs produce pretty good results.</li>
<li>U-Net - An Encoder-Decoder architecture that downsamples (encodes) and upsamples (decodes) input with skip connections between these steps. The architecture is typically visualized in a way that makes it look like a &quot;U&quot; (with skip connections between the left and right hand side of the &quot;U&quot; letter). While simple, it doesn't provide the best results.</li>
<li>DeepLabV3+ - An Encoder-Decoder architecture that liberally uses Atrous Convolutions and a module named the Atrous Spatial Pyramid Pooling (ASPP) module. More on both in a moment. One of the most accurate models to date, and it's surprisingly easy to implement as an end-to-end model for semantic segmentation.</li>
</ul>

David Landup

Intro

<h3 id="implementingatrousconvolutionalblocksandatrousspatialpyramidpooling">Implementing Atrous Convolutional Blocks and Atrous Spatial Pyramid Pooling</h3>
With the dataset ready, it's time to create our DeepLabV3+ model. Let's refer back to the diagram and translate it into code:
<img src="https://s3.stackabuse.com/media/guided+projects/deeplabv3-semantic-segmentation-with-keras-4.png" alt="">
The network uses several convolutional blocks, with differing dilation rates, both for the Atrous Spatial Pyramid Pooling (ASPP) module, and otherwise. Let's define a <code>conv_block()</code> for that first:
<pre><code class="hljs"># Turns into atrous_block with dilation_rate &gt; 1
def conv_block(block_input, num_filters=256, kernel_size=(3, 3), dilation_rate=1, padding=&quot;same&quot;):
 x = keras.layers.Conv2D(num_filters, kernel_size=kernel_size, dilation_rate=dilation_rate, padding=&quot;same&quot;)(block_input)
 x = keras.layers.BatchNormalization()(x)
 x = keras.layers.Activation(&#x27;relu&#x27;)(x)
 return x
</code></pre>
By default, it's a regular convolutional block. By setting the <code>dilation_rate</code> to anything above <code>1</code> - it becomes an &quot;atrous&quot; convolutional block. This is a pretty standard Conv-BN-ReLU block, with an adjustable <code>dilation_rate</code> parameter.
Now, let's define the ASPP module, one of the most important parts of DeepLabV3+. There's a small detail omitted from the diagram above - information on how &quot;Image Pooling&quot; is done.
<img src="https://s3.stackabuse.com/media/guided+projects/deeplabv3-semantic-segmentation-with-keras-9.png" alt="">
Regular Spatial Pyramid Pooling (on the left) downsamples the input and recovers the output from it by upsampling (encodes image into a denser vector and decodes it into a prediction). A U-Net-like encoder-decoder also does this, but injects spatial information on different scales while downsampling into the layers while upsampling (b). DeepLab tries to use the best of both of these approaches and performs Spatial Pyramid Pooling with intermediate shortcut injections of spatial context while upsampling.

DeepLabV3+ Implementation with Keras

Segmentation datasets, like object detection datasets, require a large upfront time investment. With segmentation datasets though, you'll typically be annotating everything in an image, instead of just objects of interest, and you'll be doing so more accurately along the borders of the object, instead of a box around it.
Because of this, you're likely going to be working with in-house segmentation datasets, labelled by a team trying to solve a particular problem. Segmentation models are more commonly applied to specific use-cases, and trying to train a general segmentation model, without a niche use in mind is rarer. Semantic segmentation models are more sensitive to domain shift than image classification models, in large part because image classification models are blind to many small differences by abstracting them away, while segmentation models pay much more attention to small details. The good thing is - since you're likely going to be making it for a niche problem, you'll have the same type of input during and after training!
Just like with classification and detection - datasets and models are typically single-class or multi-class. Single-class would be detecting &quot;traffic sign&quot; in an image or segmenting it. Multi-class would be detecting &quot;stop sign&quot;, &quot;crosswalk&quot; and &quot;wrong way&quot; signs, or segmenting them.

Data Preprocessing and the Albumentations Library

Semantic segmentation is the process of segmenting an image into classes - effectively, performing pixel-level classification. Color edges don't necessarily have to be the boundaries of an object, and pixel-level classification only works when you take the surrounding pixels and their context into consideration.
<img src="https://s3.stackabuse.com/media/guided+projects/deeplabv3-semantic-segmentation-with-keras-13.png" alt="">
<blockquote>
In this Guided Project, you'll learn how to build an end-to-end image segmentation model, based on the DeepLabV3+ architecture, using Python and Keras/TensorFlow.
</blockquote>
Besides Mark R-CNNs which have good performance, and U-Net-like models which don't perform as well - DeepLabV3+ performs the state of the art of image segmentation. Besides being implemented within cutting-edge platforms like Detectron2, DeepLabV3+ is the architecture powering most modern segmentation applications, particularly in medical and aerial imagery.
With high-level libraries like Keras - transferring ideas into code is easier than ever before, and we'll be converting high-level concepts into a functional model, implementing both U-Net and DeepLabV3+. This is also the perfect opportunity to introduce the Albumentations library for performant, effective data augmentation. By the end of the project - you'll have the know-how of creating highly performant image segmentation models with as little as 300 training images:
<img src="https://s3.stackabuse.com/media/guided+projects/deeplabv3-semantic-segmentation-with-keras-3.png" alt="">

 <div class="alert alert-note">
 <div class="flex">
 
 <div class="flex-shrink-0 mr-3">
 <img src="/assets/images/icon-information-circle-solid.svg" class="icon" aria-hidden="true" />
 </div>
 
 Note: This Guided Project is part of our in-depth course on <a target="_blank" href="https://stackabuse.com/courses/practical-deep-learning-for-computer-vision-with-python/">Practical Deep Learning for Computer Vision</a> and assumes that you've read the previous lessons or have that prerequisite knowledge from before.

 </div>
 </div>