Object detection is a large field in computer vision, and one of the more important applications of computer vision "in the wild". From it, keypoint detection (oftentimes used for pose estimation) was extracted.
Keypoints can be various points - parts of a face, limbs of a body, etc. Pose estimation is a special case of keypoint detection - in which the points are parts of a human body.
Pose estimation is an amazing, extremely fun and practical usage of computer vision. With it, we can do away with hardware used to estimate poses (motion capture suits), which are costly and unwieldy. Additionally, we can map the movement of humans to the movement of robots in Euclidean space, enabling fine precision motor movement without using controllers, which usually don't allow for higher levels of precision. Keypoint estimation can be used to translate our movements to 3D models in AR and VR, and increasingly is being used to do so with just a web cam. Finally - pose estimation can help us in sports and security.
In this guide, we'll be performing real-time pose estimation from a video in Python, using the state-of-the-art YOLOv7 model.
Specifically, we'll be working with a video from the 2018 winter Olympics, held in South Korea's PyeongChang:
Aljona Savchenko and Bruno Massot did an amazing performance, including overlapping bodies against the camera, fast fluid movement and spinning in the air. It'll be an amazing opportunity to see how the model handles difficult-to-infer situations!
YOLO and Pose Estimation
YOLO (You Only Look Once) is a methodology, as well as a family of models built for object detection. Since the inception in 2015, YOLOv1, YOLOv2 (YOLO9000) and YOLOv3 have been proposed by the same author(s) - and the deep learning community continued with open-sourced advancements in the continuing years.
Ultralytics' YOLOv5 is an industry-grade object detection repository, built on top of the YOLO method. It's implemented in PyTorch, as opposed to C++ for previous YOLO models, is fully open source, and has a beautifully simple and powerful API that lets you infer, train and customize the project flexibly. It's such a staple that most new attempts at improving the YOLO method build on top of it.
This is how YOLOR (You Only Learn One Representation) and YOLOv7 which built on top of YOLOR (same author) were created as well!
YOLOv7 isn't just an object detection architecture - it provides new model heads that can output keypoints (skeletons) and perform instance segmentation besides only bounding box regression, which wasn't standard with previous YOLO models. This isn't surprising, since many object detection architectures were re-purposed for instance segmentation and keypoint detection tasks earlier as well, due to the shared general architecture, with different outputs depending on the task.
Advice: If you're interested in reading more about instance segmentation, read our "Instance Segmentation with YOLOv7 in Python"!
Even though it isn't surprising - supporting instance segmentation and keypoint detection will likely become the new standard for YOLO-based models, which have begun outperforming practically all other two-stage detectors a couple of years ago in terms of both accuracy and speed.
This makes instance segmentation and keypoint detection faster to perform than ever before, with a simpler architecture than two-stage detectors.
YOLOv7 was released alongside a paper named "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors", and the source code is available on GitHub.
The model itself was created through architectural changes, as well as optimizing aspects of training, dubbed "bag-of-freebies", which increased accuracy without increasing inference cost.
Let's start by cloning the repository to get hold of the source code:
! git clone https://github.com/WongKinYiu/yolov7.git
Now, let's move into the
yolov7 directory, which contains the project, and take a look at the contents:
cd yolov7 !ls
/content/yolov7 cfg figure output.mp4 test.py data hubconf.py paper tools deploy inference README.md train_aux.py detect.py LICENSE.md requirements.txt train.py export.py models scripts utils
!cd dirname moves you into a directory in that cell. Calling
%cd dirname moves you into a directory across the upcoming cells as well and keeps you there.
Now, YOLO is meant to be an object detector, and doesn't ship with pose estimation weights by default. We'll want to download the weights and load a concrete model instance from them. The weights are available on the same GitHub repository, and can easily be downloaded through the CLI as well:
! curl -L https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7-w6-pose.pt -o yolov7-w6-pose.pt Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 153M 100 153M 0 0 23.4M 0 0:00:06 0:00:06 --:--:-- 32.3M
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Once downloaded, we can import the libraries and helper methods we'll be using:
import torch from torchvision import transforms from utils.datasets import letterbox from utils.general import non_max_suppression_kpt from utils.plots import output_to_keypoint, plot_skeleton_kpts import matplotlib.pyplot as plt import cv2 import numpy as np
Great! Let's get on with loading the model and creating a script that lets you infer poses from videos with YOLOv7 and OpenCV.
Real-Time Pose Estimation with YOLOv7
Let's first create a method to load the model from the downloaded weights. We'll check what device we have available (CPU or GPU):
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") def load_model(): model = torch.load('yolov7-w6-pose.pt', map_location=device)['model'] # Put in inference mode model.float().eval() if torch.cuda.is_available(): # half() turns predictions into float16 tensors # which significantly lowers inference time model.half().to(device) return model model = load_model()
Depending on whether we have a GPU or not, we'll turn half-precision on (using
float16 instead of
float32 in operations), which makes inference significantly faster. Note that it's highly encouraged to perform this on a GPU for real-time speeds, as CPUs will likely lack the power to do so unless running on small videos.
Let's write a convenient method for running inference. We'll accept images as NumPy arrays (as that's what we'll be passing them later while reading the video). First, using the
letterbox() function - we'll resize and pad the video to a shape that the model can work with. This doesn't need to be and won't be the shape (resolution) of the resulting video!
Then, we'll apply the transforms, convert the image to half precision (if a GPU is available), batch it and run it through the model:
def run_inference(image): # Resize and pad image image = letterbox(image, 960, stride=64, auto=True) # shape: (567, 960, 3) # Apply transforms image = transforms.ToTensor()(image) # torch.Size([3, 567, 960]) if torch.cuda.is_available(): image = image.half().to(device) # Turn image into batch image = image.unsqueeze(0) # torch.Size([1, 3, 567, 960]) with torch.no_grad(): output, _ = model(image) return output, image
We'll return the predictions of the model, as well as the image as a tensor. These are "rough" predictions - they contain many activations that overlap, and we'll want to "clean them up" using Non-Max Suppression, and plot the predicted skeletons over the image itself:
def draw_keypoints(output, image): output = non_max_suppression_kpt(output, 0.25, # Confidence Threshold 0.65, # IoU Threshold nc=model.yaml['nc'], # Number of Classes nkpt=model.yaml['nkpt'], # Number of Keypoints kpt_label=True) with torch.no_grad(): output = output_to_keypoint(output) nimg = image.permute(1, 2, 0) * 255 nimg = nimg.cpu().numpy().astype(np.uint8) nimg = cv2.cvtColor(nimg, cv2.COLOR_RGB2BGR) for idx in range(output.shape): plot_skeleton_kpts(nimg, output[idx, 7:].T, 3) return nimg
With these in place, our general flow will look like:
img = read_img() outputs, img = run_inference(img) keypoint_img = draw_keypoints(output, img)
To translate that to a real-time video setting - we'll use OpenCV to read a video, and run this process for every frame. On each frame, we'll also write the frame into a new file, encoded as a video. This will necessarily slow down the process as we're running the inference, displaying it and writing - so you can speed up the inference and display by avoiding the creation of a new file and writing to it in the loop:
def pose_estimation_video(filename): cap = cv2.VideoCapture(filename) # VideoWriter for saving the video fourcc = cv2.VideoWriter_fourcc(*'MP4V') out = cv2.VideoWriter('ice_skating_output.mp4', fourcc, 30.0, (int(cap.get(3)), int(cap.get(4)))) while cap.isOpened(): (ret, frame) = cap.read() if ret == True: frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) output, frame = run_inference(frame) frame = draw_keypoints(output, frame) frame = cv2.resize(frame, (int(cap.get(3)), int(cap.get(4)))) out.write(frame) cv2.imshow('Pose estimation', frame) else: break if cv2.waitKey(10) & 0xFF == ord('q'): break cap.release() out.release() cv2.destroyAllWindows()
VideoWriter accepts several parameters - the output filename, the FourCC (four codec codes, denoting the codec used to encode the video), the frame rate and the resolution as a tuple. To not guess or resize the video - we've used the width and height of the original video, obtained through the
VideoCapture instance that contains data about the video itself, such as the width, height, total number of frames, etc.
Now, we can call the method on any input video:
This will open up an OpenCV window, displaying the inference in real-time. And also, it'll write a video file in the
yolov7 directory (since we've
cd'd into it):
Note: If your GPU is struggling, or if you want to embed the results of a model like this into an application that has latency as a crucial aspect of the workflow - make the video smaller and work on smaller frames. This is a full HD 1920x1080 video, and should be able to run fast on most home systems, but if it doesn't work as well on your system, make the image(s) smaller.
In this guide, we've taken a look at the YOLO method, YOLOv7 and the relationship between YOLO and object detection, pose estimation and instance segmentation. We've then taken a look at how you can easily install and work with YOLOv7 using the programmatic API, and created several convenience methods to make inference and displaying results easier.
Finally, we've opened a video using OpenCV, ran inference with YOLOv7, and made a function for performing pose estimation in real-time, saving the resulting video in full resolution and 30FPS on your local disk.