Pose Estimation/Keypoint Detection with YOLOv7 in Python

Pose Estimation/Keypoint Detection with YOLOv7 in Python


Object detection is a large field in computer vision, and one of the more important applications of computer vision "in the wild". From it, keypoint detection (oftentimes used for pose estimation) was extracted.

Object detection and keypoint detection aren't as standardized as image classification, mainly because most of the new developments are typically done by individual researchers, maintainers and developers, rather than large libraries and frameworks. It's difficult to package the necessary utility scripts in a framework like TensorFlow or PyTorch and maintain the API guidelines that guided the development so far.

This makes object detection somewhat more complex, typically more verbose (but not always), and less approachable than image classification.

Fortunately for the masses - Ultralytics has developed a simple, very powerful and beautiful object detection API around their YOLOv5 which has been extended by other research and development teams into newer versions, such as YOLOv7.

In this short guide, we'll be performing Pose Estimation (Keypoint Detection) in Python, with state-of-the-art YOLOv7.

Keypoints can be various points - parts of a face, limbs of a body, etc. Pose estimation is a special case of keypoint detection - in which the points are parts of a human body, and can be used to replace expensive position tracking hardware, enable over-the-air robotics control, and power a new age of human self expression through AR and VR.

YOLO and Pose Estimation

YOLO (You Only Look Once) is a methodology, as well as family of models built for object detection. Since the inception in 2015, YOLOv1, YOLOv2 (YOLO9000) and YOLOv3 have been proposed by the same author(s) - and the deep learning community continued with open-sourced advancements in the continuing years.

Ultralytics' YOLOv5 is the first large-scale implementation of YOLO in PyTorch, which made it more accessible than ever before, but the main reason YOLOv5 has gained such a foothold is also the beautifully simple and powerful API built around it. The project abstracts away the unnecessary details, while allowing customizability, practically all usable export formats, and employs amazing practices that make the entire project both efficient and as optimal as it can be.

YOLOv5 is still the staple project to build Object Detection models with, and many repositories that aim to advance the YOLO method start with YOLOv5 as a baseline and offer a similar API (or simply fork the project and build on top of it). Such is the case of YOLOR (You Only Learn One Representation) and YOLOv7 which built on top of YOLOR (same author) which is the latest advancement in the YOLO methodology.

YOLOv7 isn't just an object detection architecture - it provides new model heads, that can output keypoints (skeletons) and perform instance segmentation besides only bounding box regression, which wasn't standard with previous YOLO models. This isn't surprising, since many object detection architectures were repurposed for instance segmentation and keypoint detection tasks earlier as well, due to the shared general architecture, with different outputs depending on the task. Even though it isn't surprising - supporting instance segmentation and keypoint detection will likely become the new standard for YOLO-based models, which have begun outperforming practically all other two-stage detectors a couple of years ago.

This makes instance segmentation and keypoint detection faster to perform than ever before, with a simpler architecture than two-stage detectors.

YOLOv7 was released alongside a paper named "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors", and the source code is available on GitHub.

The model itself was created through architectural changes, as well as optimizing aspects of training, dubbed "bag-of-freebies", which increased accuracy without increasing inference cost.

Installing YOLOv7

Let's go ahead and install the project from GitHub:

! git clone https://github.com/WongKinYiu/yolov7.git

This creates a yolov7 directory under your current working directory, in which you'll be able to find the basic project files:

%cd yolov7

LICENSE.md       detect.py        models           tools
README.md        export.py        paper            train.py
cfg              figure           requirements.txt train_aux.py
data             hubconf.py       scripts          utils
deploy           inference        test.py

Note: When calling !cd dirname, it's only applied to that cell. When calling %cd dirname, it's remembered for all subsequent cells as well.

Whenever you run code with a given set of weights - they'll be downloaded and stored in this directory. To perform pose estimation, we'll want to download the weights for the pre-trained YOLOv7 model for that task, which can be found under the /releases/download/ tab on GitHub:

! curl -L https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7-w6-pose.pt -o yolov7-w6-pose.pt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  153M  100  153M    0     0  3742k      0  0:00:42  0:00:42 --:--:-- 4573k

Great, we've downloaded the yolov7-w6-pose.pt weights file, which can be used to load and reconstruct a trained model for pose estimation.

Loading the YOLOv7 Pose Estimation Model

Let's import the libraries we'll need to perform pose estimation:

import torch
from torchvision import transforms

from utils.datasets import letterbox
from utils.general import non_max_suppression_kpt
from utils.plots import output_to_keypoint, plot_skeleton_kpts

import matplotlib.pyplot as plt
import cv2
import numpy as np

torch and torchvision are straightforward enough - YOLOv7 is implemented with PyTorch. The utils.datasets, utils.general and utils.plots modules come from the YOLOv7 project, and provide us with methods that help with preprocessing and preparing input for the model to run inference on. Amongst those are letterbox() to pad the image, non_max_supression_keypoint() to run the Non-Max Supression algorithm on the initial output of the model and to produce a clean output for our interpretation, as well as the output_to_keypoint() and plot_skeleton_kpts() methods to actually add keypoints to a given image, once they're predicted.

We can load the model from the weight file with torch.load(). Let's create a function to check if a GPU is available, load the model, put it in inference mode and move it to the GPU if available:

def load_model():
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model = torch.load('yolov7/yolov7-w6-pose.pt', map_location=device)['model']
    # Put in inference mode

    if torch.cuda.is_available():
        # half() turns predictions into float16 tensors
        # which significantly lowers inference time
    return model

model = load_model()

With the model loaded, let's create a run_inference() method that accepts a string pointing to a file on our system. The method will read the image using OpenCV (cv2), pad it with letterbox(), apply transforms to it, and turn it into a batch (the model is trained on and expects batches, as usual):

def run_inference(url):
    image = cv2.imread(url) # shape: (480, 640, 3)
    # Resize and pad image
    image = letterbox(image, 960, stride=64, auto=True)[0] # shape: (768, 960, 3)
    # Apply transforms
    image = transforms.ToTensor()(image) # torch.Size([3, 768, 960])
    # Turn image into batch
    image = image.unsqueeze(0) # torch.Size([1, 3, 768, 960])
    output, _ = model(image) # torch.Size([1, 45900, 57])
    return output, image

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Here, we've returned the transformed image (because we'll want to extract the original and plot on it) and the outputs of the model. These outputs contain 45900 keypoint predictions, most of which overlap. We'll want to apply Non-Max Supression to these raw predictions, just as with Object Detection predictions (where many bounding boxes are predicted and then they're "collapsed" given some confidence and IoU threshold). After supression, we can plot each keypoint on the original image and display it:

def visualize_output(output, image):
    output = non_max_suppression_kpt(output, 
                                     0.25, # Confidence Threshold
                                     0.65, # IoU Threshold
                                     nc=model.yaml['nc'], # Number of Classes
                                     nkpt=model.yaml['nkpt'], # Number of Keypoints
    with torch.no_grad():
        output = output_to_keypoint(output)
    nimg = image[0].permute(1, 2, 0) * 255
    nimg = nimg.cpu().numpy().astype(np.uint8)
    nimg = cv2.cvtColor(nimg, cv2.COLOR_RGB2BGR)
    for idx in range(output.shape[0]):
        plot_skeleton_kpts(nimg, output[idx, 7:].T, 3)
    plt.figure(figsize=(12, 12))

Now, for some input image, such as karate.jpg in the main working directory, we can run inference, perform Non-Max Supression and plot the results with:

output, image = run_inference('../basketball.jpg') # Bryan Reyes on Unsplash
visualize_output(output, image)

This results in:

Or another one:

output, image = run_inference('../karate.jpg') # Mondo Generator on Unsplash
visualize_output(output, image)

This is a fairly difficult image to infer! Most of the right arm of the practitioner on the right is hidden, and we can see that the model inferred that it is hidden and to the right of the body, missing that the elbow is bent and that a portion of the arm is in front. The practicioner on the left, which is much more clearly seen, is inferred correctly, even with a hidden leg.

As a matter of fact - a person sitting in the back, almost fully invisible to the camera has had their pose seemingly correctly estimated, just based on the position of the hips while sitting down. Great work on behalf of the network!


In thi guide - we've taken a brief look at YOLOv7, the latest advancement in the YOLO family, which builds on top of YOLOR, and further provides instance segmentation and keypoint detection capabilities beyond the standard object detection capabilities of most YOLO-based models.

We've then taken a look at how we can download released weight files, load them in to construct a model and perform pose estimation inference for humans, yielding impressive results.

Going Further - Practical Deep Learning for Computer Vision

Your inquisitive nature makes you want to go further? We recommend checking out our Course: "Practical Deep Learning for Computer Vision with Python".

Another Computer Vision Course?

We won't be doing classification of MNIST digits or MNIST fashion. They served their part a long time ago. Too many learning resources are focusing on basic datasets and basic architectures before letting advanced black-box architectures shoulder the burden of performance.

We want to focus on demystification, practicality, understanding, intuition and real projects. Want to learn how you can make a difference? We'll take you on a ride from the way our brains process images to writing a research-grade deep learning classifier for breast cancer to deep learning networks that "hallucinate", teaching you the principles and theory through practical work, equipping you with the know-how and tools to become an expert at applying deep learning to solve computer vision.

What's inside?

  • The first principles of vision and how computers can be taught to "see"
  • Different tasks and applications of computer vision
  • The tools of the trade that will make your work easier
  • Finding, creating and utilizing datasets for computer vision
  • The theory and application of Convolutional Neural Networks
  • Handling domain shift, co-occurrence, and other biases in datasets
  • Transfer Learning and utilizing others' training time and computational resources for your benefit
  • Building and training a state-of-the-art breast cancer classifier
  • How to apply a healthy dose of skepticism to mainstream ideas and understand the implications of widely adopted techniques
  • Visualizing a ConvNet's "concept space" using t-SNE and PCA
  • Case studies of how companies use computer vision techniques to achieve better results
  • Proper model evaluation, latent space visualization and identifying the model's attention
  • Performing domain research, processing your own datasets and establishing model tests
  • Cutting-edge architectures, the progression of ideas, what makes them unique and how to implement them
  • KerasCV - a WIP library for creating state of the art pipelines and models
  • How to parse and read papers and implement them yourself
  • Selecting models depending on your application
  • Creating an end-to-end machine learning pipeline
  • Landscape and intuition on object detection with Faster R-CNNs, RetinaNets, SSDs and YOLO
  • Instance and semantic segmentation
  • Real-Time Object Recognition with YOLOv5
  • Training YOLOv5 Object Detectors
  • Working with Transformers using KerasNLP (industry-strength WIP library)
  • Integrating Transformers with ConvNets to generate captions of images
  • DeepDream
  • Deep Learning model optimization for computer vision
Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

David LandupAuthor

Entrepreneur, Software and Machine Learning Engineer, with a deep fascination towards the application of Computation and Deep Learning in Life Sciences (Bioinformatics, Drug Discovery, Genomics), Neuroscience (Computational Neuroscience), robotics and BCIs.

Great passion for accessible education and promotion of reason, science, humanism, and progress.


Real-Time Road Sign Detection with YOLOv5

# python# machine learning# computer vision# pytorch

If you drive - there's a chance you enjoy cruising down the road. A responsible driver pays attention to the road signs, and adjusts their...

David Landup
David Landup

Practical Deep Learning for Computer Vision with Python

# python# machine learning# tensorflow# computer vision

DeepDream with TensorFlow/Keras Keypoint Detection with Detectron2 Image Captioning with KerasNLP Transformers and ConvNets Semantic Segmentation with DeepLabV3+ in Keras Real-Time Object Detection from...

David Landup
Jovana Ninkovic

© 2013-2022 Stack Abuse. All rights reserved.