Real-Time Pose Estimation from Video in Python with YOLOv7

Introduction

Object detection is a large field in computer vision, and one of the more important applications of computer vision "in the wild". From it, keypoint detection (oftentimes used for pose estimation) was extracted.

Keypoints can be various points - parts of a face, limbs of a body, etc. Pose estimation is a special case of keypoint detection - in which the points are parts of a human body.

Pose estimation is an amazing, extremely fun and practical usage of computer vision. With it, we can do away with hardware used to estimate poses (motion capture suits), which are costly and unwieldy. Additionally, we can map the movement of humans to the movement of robots in Euclidean space, enabling fine precision motor movement without using controllers, which usually don't allow for higher levels of precision. Keypoint estimation can be used to translate our movements to 3D models in AR and VR, and increasingly is being used to do so with just a web cam. Finally - pose estimation can help us in sports and security.

In this guide, we'll be performing real-time pose estimation from a video in Python, using the state-of-the-art YOLOv7 model.

Specifically, we'll be working with a video from the 2018 winter Olympics, held in South Korea's PyeongChang:

Aljona Savchenko and Bruno Massot did an amazing performance, including overlapping bodies against the camera, fast fluid movement and spinning in the air. It'll be an amazing opportunity to see how the model handles difficult-to-infer situations!

YOLO and Pose Estimation

YOLO (You Only Look Once) is a methodology, as well as a family of models built for object detection. Since the inception in 2015, YOLOv1, YOLOv2 (YOLO9000) and YOLOv3 have been proposed by the same author(s) - and the deep learning community continued with open-sourced advancements in the continuing years.

Ultralytics' YOLOv5 is an industry-grade object detection repository, built on top of the YOLO method. It's implemented in PyTorch, as opposed to C++ for previous YOLO models, is fully open source, and has a beautifully simple and powerful API that lets you infer, train and customize the project flexibly. It's such a staple that most new attempts at improving the YOLO method build on top of it.

This is how YOLOR (You Only Learn One Representation) and YOLOv7 which built on top of YOLOR (same author) were created as well!

YOLOv7 isn't just an object detection architecture - it provides new model heads that can output keypoints (skeletons) and perform instance segmentation besides only bounding box regression, which wasn't standard with previous YOLO models. This isn't surprising, since many object detection architectures were re-purposed for instance segmentation and keypoint detection tasks earlier as well, due to the shared general architecture, with different outputs depending on the task.

Advice: If you're interested in reading more about instance segmentation, read our "Instance Segmentation with YOLOv7 in Python"!

Even though it isn't surprising - supporting instance segmentation and keypoint detection will likely become the new standard for YOLO-based models, which have begun outperforming practically all other two-stage detectors a couple of years ago in terms of both accuracy and speed.

This makes instance segmentation and keypoint detection faster to perform than ever before, with a simpler architecture than two-stage detectors.

YOLOv7 was released alongside a paper named "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors", and the source code is available on GitHub.

The model itself was created through architectural changes, as well as optimizing aspects of training, dubbed "bag-of-freebies", which increased accuracy without increasing inference cost.

Installing YOLOv7

Let's start by cloning the repository to get hold of the source code:

! git clone https://github.com/WongKinYiu/yolov7.git

Now, let's move into the yolov7 directory, which contains the project, and take a look at the contents:

%cd yolov7
!ls
/content/yolov7
cfg	   figure      output.mp4	 test.py       
data	   hubconf.py  paper		 tools
deploy	   inference   README.md	 train_aux.py
detect.py  LICENSE.md  requirements.txt  train.py
export.py  models      scripts		 utils

Note: Calling !cd dirname moves you into a directory in that cell. Calling %cd dirname moves you into a directory across the upcoming cells as well and keeps you there.

Now, YOLO is meant to be an object detector, and doesn't ship with pose estimation weights by default. We'll want to download the weights and load a concrete model instance from them. The weights are available on the same GitHub repository, and can easily be downloaded through the CLI as well:

! curl -L https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7-w6-pose.pt -o yolov7-w6-pose.pt

 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  153M  100  153M    0     0  23.4M      0  0:00:06  0:00:06 --:--:-- 32.3M
Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Once downloaded, we can import the libraries and helper methods we'll be using:

import torch
from torchvision import transforms

from utils.datasets import letterbox
from utils.general import non_max_suppression_kpt
from utils.plots import output_to_keypoint, plot_skeleton_kpts

import matplotlib.pyplot as plt
import cv2
import numpy as np

Great! Let's get on with loading the model and creating a script that lets you infer poses from videos with YOLOv7 and OpenCV.

Real-Time Pose Estimation with YOLOv7

Let's first create a method to load the model from the downloaded weights. We'll check what device we have available (CPU or GPU):

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def load_model():
    model = torch.load('yolov7-w6-pose.pt', map_location=device)['model']
    # Put in inference mode
    model.float().eval()

    if torch.cuda.is_available():
        # half() turns predictions into float16 tensors
        # which significantly lowers inference time
        model.half().to(device)
    return model

model = load_model()

Depending on whether we have a GPU or not, we'll turn half-precision on (using float16 instead of float32 in operations), which makes inference significantly faster. Note that it's highly encouraged to perform this on a GPU for real-time speeds, as CPUs will likely lack the power to do so unless running on small videos.

Let's write a convenient method for running inference. We'll accept images as NumPy arrays (as that's what we'll be passing them later while reading the video). First, using the letterbox() function - we'll resize and pad the video to a shape that the model can work with. This doesn't need to be and won't be the shape (resolution) of the resulting video!

Then, we'll apply the transforms, convert the image to half precision (if a GPU is available), batch it and run it through the model:

def run_inference(image):
    # Resize and pad image
    image = letterbox(image, 960, stride=64, auto=True)[0] # shape: (567, 960, 3)
    # Apply transforms
    image = transforms.ToTensor()(image) # torch.Size([3, 567, 960])
    if torch.cuda.is_available():
      image = image.half().to(device)
    # Turn image into batch
    image = image.unsqueeze(0) # torch.Size([1, 3, 567, 960])
    with torch.no_grad():
      output, _ = model(image)
    return output, image

We'll return the predictions of the model, as well as the image as a tensor. These are "rough" predictions - they contain many activations that overlap, and we'll want to "clean them up" using Non-Max Suppression, and plot the predicted skeletons over the image itself:

def draw_keypoints(output, image):
  output = non_max_suppression_kpt(output, 
                                     0.25, # Confidence Threshold
                                     0.65, # IoU Threshold
                                     nc=model.yaml['nc'], # Number of Classes
                                     nkpt=model.yaml['nkpt'], # Number of Keypoints
                                     kpt_label=True)
  with torch.no_grad():
        output = output_to_keypoint(output)
  nimg = image[0].permute(1, 2, 0) * 255
  nimg = nimg.cpu().numpy().astype(np.uint8)
  nimg = cv2.cvtColor(nimg, cv2.COLOR_RGB2BGR)
  for idx in range(output.shape[0]):
      plot_skeleton_kpts(nimg, output[idx, 7:].T, 3)

  return nimg

With these in place, our general flow will look like:

img = read_img()
outputs, img = run_inference(img)
keypoint_img = draw_keypoints(output, img)

To translate that to a real-time video setting - we'll use OpenCV to read a video, and run this process for every frame. On each frame, we'll also write the frame into a new file, encoded as a video. This will necessarily slow down the process as we're running the inference, displaying it and writing - so you can speed up the inference and display by avoiding the creation of a new file and writing to it in the loop:

def pose_estimation_video(filename):
    cap = cv2.VideoCapture(filename)
    # VideoWriter for saving the video
    fourcc = cv2.VideoWriter_fourcc(*'MP4V')
    out = cv2.VideoWriter('ice_skating_output.mp4', fourcc, 30.0, (int(cap.get(3)), int(cap.get(4))))
    while cap.isOpened():
        (ret, frame) = cap.read()
        if ret == True:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            output, frame = run_inference(frame)
            frame = draw_keypoints(output, frame)
            frame = cv2.resize(frame, (int(cap.get(3)), int(cap.get(4))))
            out.write(frame)
            cv2.imshow('Pose estimation', frame)
        else:
            break

        if cv2.waitKey(10) & 0xFF == ord('q'):
            break

    cap.release()
    out.release()
    cv2.destroyAllWindows()

The VideoWriter accepts several parameters - the output filename, the FourCC (four codec codes, denoting the codec used to encode the video), the frame rate and the resolution as a tuple. To not guess or resize the video - we've used the width and height of the original video, obtained through the VideoCapture instance that contains data about the video itself, such as the width, height, total number of frames, etc.

Now, we can call the method on any input video:

pose_estimation_video('../ice_skating.mp4')

This will open up an OpenCV window, displaying the inference in real-time. And also, it'll write a video file in the yolov7 directory (since we've cd'd into it):

Note: If your GPU is struggling, or if you want to embed the results of a model like this into an application that has latency as a crucial aspect of the workflow - make the video smaller and work on smaller frames. This is a full HD 1920x1080 video, and should be able to run fast on most home systems, but if it doesn't work as well on your system, make the image(s) smaller.

Conclusion

In this guide, we've taken a look at the YOLO method, YOLOv7 and the relationship between YOLO and object detection, pose estimation and instance segmentation. We've then taken a look at how you can easily install and work with YOLOv7 using the programmatic API, and created several convenience methods to make inference and displaying results easier.

Finally, we've opened a video using OpenCV, ran inference with YOLOv7, and made a function for performing pose estimation in real-time, saving the resulting video in full resolution and 30FPS on your local disk.

Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

David LandupAuthor

Entrepreneur, Software and Machine Learning Engineer, with a deep fascination towards the application of Computation and Deep Learning in Life Sciences (Bioinformatics, Drug Discovery, Genomics), Neuroscience (Computational Neuroscience), robotics and BCIs.

Great passion for accessible education and promotion of reason, science, humanism, and progress.

© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms