Instance Segmentation with YOLOv7 in Python

Instance Segmentation with YOLOv7 in Python


Object detection is a large field in computer vision, and one of the more important applications of computer vision "in the wild". From it, instance segmentation was extracted, and is tasked with having models predict not only the label and bounding box of an object, but also the "area" it covers - classifying each pixel that belongs to that object.

Semantic Segmentation classifies all pixels in an image to their semantic label (car, pavement, building). Instance Segmentation classifies all pixels of each detected object individually, and Car1 is differentiated from Car2.

Conceptually - they're similar, but instance segmentation combines semantic segmentation and object detection. Thankfully, object detection, semantic segmentation and by extenssion instance segmentation can be done with a common back-end, with different heads of the network, as they're tasked with a conceptually similar task, and thus share computational representations of that knowledge.

Object detection, semantic segmentation, instance segmentation and keypoint detection aren't as standardized as image classification, mainly because most of the new developments are typically done by individual researchers, maintainers and developers, rather than large libraries and frameworks. It's difficult to package the necessary utility scripts in a framework like TensorFlow or PyTorch and maintain the API guidelines that guided the development so far.

Fortunately for the masses - Ultralytics has developed a simple, very powerful and beautiful object detection API around their YOLOv5 which has been extended by other research and development teams into newer versions, such as YOLOv7.

In this short guide, we'll be performing Instance Segmentation in Python, with state-of-the-art YOLOv7.

YOLO and Instance Segmentation

YOLO (You Only Look Once) is a methodology, as well as family of models built for object detection. Since the inception in 2015, YOLOv1, YOLOv2 (YOLO9000) and YOLOv3 have been proposed by the same author(s) - and the deep learning community continued with open-sourced advancements in the continuing years.

Ultralytics' YOLOv5 is a massive repository, and the first production-level implementation of YOLO in PyTorch, which has seen major usage in the industry. The PyTorch implementation made it more accessible than ever before, which were usually done in C++, but the main reason it became so popular is because of the beautifully simple and powerful API built around it, which allows anyone that can run a few lines of Python code able to build object detectors.

YOLOv5 has become such a staple that most repositories that aim to advance the YOLO method use it as a basis and offer a similar API inherited from Ultralytics. YOLOR (You Only Learn One Representation) did exactly this, and YOLOv7 was built on top of YOLOR by the same authors.

YOLOv7 is the first YOLO model that ships with new models heads, allowing for keypoints, instance segmentation and object detection, which was a very sensible addition. Hopefully, going forward, we'll see an increasing number of YOLO-based models that offer similar capabilities out of the box.

This makes instance segmentation and keypoint detection faster to perform than ever before, with a simpler architecture than two-stage detectors.

YOLOv7 was released alongside a paper named "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors", and the source code is available on GitHub.

The model itself was created through architectural changes, as well as optimizing aspects of training, dubbed "bag-of-freebies", which increased accuracy without increasing inference cost.

Instance Segmentation with YOLOv7

A standard library used for instance segmentation, object detection and keypoint estimation in Python is Detectron2, built by Meta AI.

The library offers various convinience methods and classes to help visualize results beautifully, but the underlying implementation for detection is a Mask R-CNN. YOLO has been shown to outperform R-CNN-based models across the board. The YOLOv7 repository is Detectron2-compatible and is compliant with it's API and visualization tools, making it easier to run fast, accurate instance segmentation without having to learn a new API. You can, in effect, swap out the Mask R-CNN backbone and replace it with YOLOv7.

Advice: If you'd like to read more about Detectron2 - read our "Object Detection and Instance Segmentation in Python with Detectron2"!

Installing Dependencies - YOLOv7 and Detectron2

Let's first go ahead and install the dependencies. We'll clone the GitHub repo for the YOLOv7 project, and install the latest Detectron2 version via pip:

! git clone -b mask
! pip install pyyaml==5.1
! pip install 'git+'

Detectron2 requires pyyaml as well. To ensure compatability, you'll also want to specify the running torch version:

! pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f

The main branch of YOLOv7 doesn't support instance segmentation, as it has a dependency on a third-party project. However, the mask branch was made exactly for this support, so we're installing the mask branch of the project. Finally, you'll want to download the pre-trained weights for the instance segmentation model either manually or with:

%cd yolov7
! curl -L -o

We've first moved into the yolov7 directory (the downloaded directory containing the project) and then downloaded the weights file there. With that - our dependencies are set up! Let's import the packages and classes we'll be using:

import matplotlib.pyplot as plt
import torch
import cv2
import yaml
from torchvision import transforms
import numpy as np

from utils.datasets import letterbox
from utils.general import non_max_suppression_mask_conf

from detectron2.modeling.poolers import ROIPooler
from detectron2.structures import Boxes
from detectron2.utils.memory import retry_if_cuda_oom
from detectron2.layers import paste_masks_in_image

Instance Segmentation Inference with YOLOv7

Let's first take a look at the image we'll be segmenting:

street_img = cv2.imread('../street.png')
street_img = cv2.cvtColor(street_img, cv2.COLOR_BGR2RGB)

fig = plt.figure(figsize=(12, 6))

It's a screenshot from the live view of Google Maps! Since the model isn't pre-trained on many classes, we'll likely only see semantic segmentation for classes like 'person', 'car', etc. without "fine-grained" classes like 'traffic light'.

We can now get to loading the model and preparing it for inference. The hyp.scratch.mask.yaml file contains configurations for hyperparameters, so we'll initially load it in, check for the active device (GPU or CPU), and load the model from the weights file we just downloaded:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

with open('data/hyp.scratch.mask.yaml') as f:
    hyp = yaml.load(f, Loader=yaml.FullLoader)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def load_model():
    model = torch.load('', map_location=device)['model']
    # Put in inference mode

    if torch.cuda.is_available():
        # half() turns predictions into float16 tensors
        # which significantly lowers inference time
    return model

model = load_model()

Next, let's create a helper method to run inference! We'll want it to read an image, reshape it and pad it to the expected input size, apply transforms, batch it and pass it into the model:

def run_inference(url):
    image = cv2.imread(url) # shape: (480, 640, 3)
    # Resize and pad image
    image = letterbox(image, 640, stride=64, auto=True)[0] # shape: (480, 640, 3)
    # Apply transforms
    image = transforms.ToTensor()(image) # torch.Size([3, 480, 640])
    # Match tensor type (`torch.FloatTensor` -> `torch.HalfTensor`) with model
    image = image.half().to(device)
    # Turn image into batch
    image = image.unsqueeze(0) # torch.Size([1, 3, 480, 640])
    output = model(image)
    return output, image

output, image = run_inference('../street.png')

The function returns the output of the model, as well as the image itself (loaded, padded and otherwise processed). The output is a dictionary:

# dict_keys(['mask_iou', 'test', 'attn', 'bbox_and_cls', 'bases', 'sem'])

The predictions that the model made are raw - we'll need to pass them through non_max_supression(), and utilize the ROIPooler from Detectron2.

Note: "ROI Pooling" is short for "Region of Interest Pooling" and is used to extract small feature maps for object detection and segmentation tasks, in regions that may contain objects.

inf_out = output['test']
attn = output['attn']
bases = output['bases']
sem_output = output['sem']

bases =[bases, sem_output], dim=1)
nb, _, height, width = image.shape
names = model.names
pooler_scale = model.pooler_scale

pooler = ROIPooler(output_size=hyp['mask_resolution'], 
# output, output_mask, output_mask_score, output_ac, output_ab
output, output_mask, _, _, _ = non_max_suppression_mask_conf(inf_out, 

Here - we've obtained the predictions for objects and their labels in output and the masks that should cover those objects in output_mask:

output[0].shape # torch.Size([30, 6])
output_mask[0].shape # torch.Size([30, 3136])

The model found 30 instances in the image, each with a label associated with them. Let's create boxes for our instances with the help of Detectron2's Boxes class and condense the pred_masks (which contain a boolean mask) into a set of pixels that we can apply over the original image:

pred, pred_masks = output[0], output_mask[0]
base = bases[0]
bboxes = Boxes(pred[:, :4])

original_pred_masks = pred_masks.view(-1, 

pred_masks = retry_if_cuda_oom(paste_masks_in_image)(original_pred_masks, 
                                                     (height, width), 
# Detach Tensors from the device, send to the CPU and turn into NumPy arrays
pred_masks_np = pred_masks.detach().cpu().numpy()
pred_cls = pred[:, 5].detach().cpu().numpy()
pred_conf = pred[:, 4].detach().cpu().numpy()
nimg = image[0].permute(1, 2, 0) * 255
nimg = nimg.cpu().numpy().astype(np.uint8)
nimg = cv2.cvtColor(nimg, cv2.COLOR_RGB2BGR)
nbboxes = bboxes.tensor.detach().cpu().numpy().astype(

The original_pred_masks denotes the predicted masks for the original image:

original_pred_masks.shape # torch.Size([30, 56, 56])

And finally, we can plot the results with:

def plot_results(original_image, pred_img, pred_masks_np, nbboxes, pred_cls, pred_conf, plot_labels=True):
  for one_mask, bbox, cls, conf in zip(pred_masks_np, nbboxes, pred_cls, pred_conf):
    if conf < 0.25:
    color = [np.random.randint(255), np.random.randint(255), np.random.randint(255)]

    pred_img = pred_img.copy()
    # Apply mask over image in color
    pred_img[one_mask] = pred_img[one_mask] * 0.5 + np.array(color, dtype=np.uint8) * 0.5
    # Draw rectangles around all found objects
    pred_img = cv2.rectangle(pred_img, (bbox[0], bbox[1]), (bbox[2], bbox[3]), color, 2)

    if plot_labels:
      label = '%s %.3f' % (names[int(cls)], conf)
      t_size = cv2.getTextSize(label, 0, fontScale=0.1, thickness=1)[0]
      c2 = bbox[0] + t_size[0], bbox[1] - t_size[1] - 3
      pred_img = cv2.rectangle(pred_img, (bbox[0], bbox[1]), c2, color, -1, cv2.LINE_AA)
      pred_img = cv2.putText(pred_img, label, (bbox[0], bbox[1] - 2), 0, 0.5, [255, 255, 255], thickness=1, lineType=cv2.LINE_AA)  

  fig, ax = plt.subplots(1, 2, figsize=(pred_img.shape[0]/10, pred_img.shape[1]/10), dpi=150)

  original_image = np.moveaxis(image.cpu().numpy().squeeze(), 0, 2).astype('float32')
  original_image = cv2.cvtColor(original_image, cv2.COLOR_RGB2BGR)

The image is copied so we don't apply transformations to the image in-place, but on a copy. For each pixel that matches between the input image and the predicted masks, we apply a color with an opacity of 0.5 and for each object, we draw a cv2.Rectangle() that encompasses it from the bounding boxes (bbox). If you wish to plot labels, for which there might be significant overlap, there's a plot_labels flag in the plot_results() method signature. Let's try plotting the image we've started working with earlier with and without labels:

%matplotlib inline
plot_results(image, nimg, pred_masks_np, nbboxes, pred_cls, pred_conf, plot_labels=False)
%matplotlib inline
plot_results(image, nimg, pred_masks_np, nbboxes, pred_cls, pred_conf, plot_labels=True)

We've plotted both images - the original and the segmented image in one plot. For higher resolution, adjust the dpi (dots per inch) argument in the subplots() call, and plot just the image with the predicted segmentation map/labels to occupy the figure in its entirety.

Going Further - Practical Deep Learning for Computer Vision

Your inquisitive nature makes you want to go further? We recommend checking out our Course: "Practical Deep Learning for Computer Vision with Python".

Another Computer Vision Course?

We won't be doing classification of MNIST digits or MNIST fashion. They served their part a long time ago. Too many learning resources are focusing on basic datasets and basic architectures before letting advanced black-box architectures shoulder the burden of performance.

We want to focus on demystification, practicality, understanding, intuition and real projects. Want to learn how you can make a difference? We'll take you on a ride from the way our brains process images to writing a research-grade deep learning classifier for breast cancer to deep learning networks that "hallucinate", teaching you the principles and theory through practical work, equipping you with the know-how and tools to become an expert at applying deep learning to solve computer vision.

What's inside?

  • The first principles of vision and how computers can be taught to "see"
  • Different tasks and applications of computer vision
  • The tools of the trade that will make your work easier
  • Finding, creating and utilizing datasets for computer vision
  • The theory and application of Convolutional Neural Networks
  • Handling domain shift, co-occurrence, and other biases in datasets
  • Transfer Learning and utilizing others' training time and computational resources for your benefit
  • Building and training a state-of-the-art breast cancer classifier
  • How to apply a healthy dose of skepticism to mainstream ideas and understand the implications of widely adopted techniques
  • Visualizing a ConvNet's "concept space" using t-SNE and PCA
  • Case studies of how companies use computer vision techniques to achieve better results
  • Proper model evaluation, latent space visualization and identifying the model's attention
  • Performing domain research, processing your own datasets and establishing model tests
  • Cutting-edge architectures, the progression of ideas, what makes them unique and how to implement them
  • KerasCV - a WIP library for creating state of the art pipelines and models
  • How to parse and read papers and implement them yourself
  • Selecting models depending on your application
  • Creating an end-to-end machine learning pipeline
  • Landscape and intuition on object detection with Faster R-CNNs, RetinaNets, SSDs and YOLO
  • Instance and semantic segmentation
  • Real-Time Object Recognition with YOLOv5
  • Training YOLOv5 Object Detectors
  • Working with Transformers using KerasNLP (industry-strength WIP library)
  • Integrating Transformers with ConvNets to generate captions of images
  • DeepDream
  • Deep Learning model optimization for computer vision
Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

David LandupAuthor

Entrepreneur, Software and Machine Learning Engineer, with a deep fascination towards the application of Computation and Deep Learning in Life Sciences (Bioinformatics, Drug Discovery, Genomics), Neuroscience (Computational Neuroscience), robotics and BCIs.

Great passion for accessible education and promotion of reason, science, humanism, and progress.


Real-Time Road Sign Detection with YOLOv5

# python# machine learning# computer vision# pytorch

If you drive - there's a chance you enjoy cruising down the road. A responsible driver pays attention to the road signs, and adjusts their...

David Landup
David Landup

Practical Deep Learning for Computer Vision with Python

# python# machine learning# tensorflow# computer vision

DeepDream with TensorFlow/Keras Keypoint Detection with Detectron2 Image Captioning with KerasNLP Transformers and ConvNets Semantic Segmentation with DeepLabV3+ in Keras Real-Time Object Detection from...

David Landup
Jovana Ninkovic

© 2013-2022 Stack Abuse. All rights reserved.