Semantic and Instance Segmentation is the natural next step of object detection, and uses much the same architectures with new heads to predict masks, rather than bounding boxes. Many object detection architectures can be converted into segmentation architectures, and some projects ship both capabilities such as Detectron2.
We've worked with YOLOv5 by Ultralytics in a previous project, which currently doesn't support segmentation, but it is in the works. When it gets released in a later version, I'll update the course.
As with every task - there are various architectures that can be employed to perform segmentation. Some of the defining ones are:
- Mask R-CNN - A Faster R-CNN variant which predicts masks for detected objects. Mask R-CNNs produce pretty good results.
- U-Net - An Encoder-Decoder architecture that downsamples (encodes) and upsamples (decodes) input with skip connections between these steps. The architecture is typically visualized in a way that makes it look like a "U" (with skip connections between the left and right hand side of the "U" letter). While simple, it doesn't provide the best results.
- DeepLabV3+ - An Encoder-Decoder architecture that liberally uses Atrous Convolutions and a module named the Atrous Spatial Pyramid Pooling (ASPP) module. More on both in a moment. One of the most accurate models to date, and it's surprisingly easy to implement as an end-to-end model for semantic segmentation.