Introduction
Object detection is a large field in computer vision, and one of the more important applications of computer vision "in the wild". From it, keypoint detection (oftentimes used for pose estimation) was extracted.
Object detection and keypoint detection aren't as standardized as image classification, mainly because most of the new developments are typically done by individual researchers, maintainers and developers, rather than large libraries and frameworks. It's difficult to package the necessary utility scripts in a framework like TensorFlow or PyTorch and maintain the API guidelines that guided the development so far.
This makes object detection somewhat more complex, typically more verbose (but not always), and less approachable than image classification.
Fortunately for the masses - Ultralytics has developed a simple, very powerful and beautiful object detection API around their YOLOv5 which has been extended by other research and development teams into newer versions, such as YOLOv7.
In this short guide, we'll be performing Pose Estimation (Keypoint Detection) in Python, with state-of-the-art YOLOv7.
Keypoints can be various points - parts of a face, limbs of a body, etc. Pose estimation is a special case of keypoint detection - in which the points are parts of a human body, and can be used to replace expensive position tracking hardware, enable over-the-air robotics control, and power a new age of human self expression through AR and VR.
YOLO and Pose Estimation
YOLO (You Only Look Once) is a methodology, as well as family of models built for object detection. Since the inception in 2015, YOLOv1, YOLOv2 (YOLO9000) and YOLOv3 have been proposed by the same author(s) - and the deep learning community continued with open-sourced advancements in the continuing years.
Ultralytics' YOLOv5 is the first large-scale implementation of YOLO in PyTorch, which made it more accessible than ever before, but the main reason YOLOv5 has gained such a foothold is also the beautifully simple and powerful API built around it. The project abstracts away the unnecessary details, while allowing customizability, practically all usable export formats, and employs amazing practices that make the entire project both efficient and as optimal as it can be.
YOLOv5 is still the staple project to build Object Detection models with, and many repositories that aim to advance the YOLO method start with YOLOv5 as a baseline and offer a similar API (or simply fork the project and build on top of it). Such is the case of YOLOR (You Only Learn One Representation) and YOLOv7 which built on top of YOLOR (same author) which is the latest advancement in the YOLO methodology.
YOLOv7 isn't just an object detection architecture - it provides new model heads that can output keypoints (skeletons) and perform instance segmentation besides only bounding box regression, which wasn't standard with previous YOLO models. This isn't surprising, since many object detection architectures were re-purposed for instance segmentation and keypoint detection tasks earlier as well, due to the shared general architecture, with different outputs depending on the task. Even though it isn't surprising - supporting instance segmentation and keypoint detection will likely become the new standard for YOLO-based models, which have begun outperforming practically all other two-stage detectors a couple of years ago.
This makes instance segmentation and keypoint detection faster to perform than ever before, with a simpler architecture than two-stage detectors.
YOLOv7 was released alongside a paper named "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors", and the source code is available on GitHub.
The model itself was created through architectural changes, as well as optimizing aspects of training, dubbed "bag-of-freebies", which increased accuracy without increasing inference cost.
Installing YOLOv7
Let's go ahead and install the project from GitHub:
! git clone https://github.com/WongKinYiu/yolov7.git
This creates a yolov7
directory under your current working directory, in which you'll be able to find the basic project files:
%cd yolov7
!ls
/Users/macbookpro/jup/yolov7
LICENSE.md detect.py models tools
README.md export.py paper train.py
cfg figure requirements.txt train_aux.py
data hubconf.py scripts utils
deploy inference test.py
Note: When calling !cd dirname
, it's only applied to that cell. When calling %cd dirname
, it's remembered for all subsequent cells as well.
Whenever you run code with a given set of weights - they'll be downloaded and stored in this directory. To perform pose estimation, we'll want to download the weights for the pre-trained YOLOv7 model for that task, which can be found under the /releases/download/
tab on GitHub:
! curl -L https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7-w6-pose.pt -o yolov7-w6-pose.pt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 153M 100 153M 0 0 3742k 0 0:00:42 0:00:42 --:--:-- 4573k
Great, we've downloaded the yolov7-w6-pose.pt
weights file, which can be used to load and reconstruct a trained model for pose estimation.
Loading the YOLOv7 Pose Estimation Model
Let's import the libraries we'll need to perform pose estimation:
import torch
from torchvision import transforms
from utils.datasets import letterbox
from utils.general import non_max_suppression_kpt
from utils.plots import output_to_keypoint, plot_skeleton_kpts
import matplotlib.pyplot as plt
import cv2
import numpy as np
torch
and torchvision
are straightforward enough - YOLOv7 is implemented with PyTorch. The utils.datasets
, utils.general
and utils.plots
modules come from the YOLOv7 project, and provide us with methods that help with preprocessing and preparing input for the model to run inference on. Amongst those are letterbox()
to pad the image, non_max_supression_keypoint()
to run the Non-Max Suppression algorithm on the initial output of the model and to produce a clean output for our interpretation, as well as the output_to_keypoint()
and plot_skeleton_kpts()
methods to actually add keypoints to a given image, once they're predicted.
We can load the model from the weight file with torch.load()
. Let's create a function to check if a GPU is available, load the model, put it in inference mode and move it to the GPU if available:
def load_model():
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = torch.load('yolov7/yolov7-w6-pose.pt', map_location=device)['model']
# Put in inference mode
model.float().eval()
if torch.cuda.is_available():
# half() turns predictions into float16 tensors
# which significantly lowers inference time
model.half().to(device)
return model
model = load_model()
With the model loaded, let's create a run_inference()
method that accepts a string pointing to a file on our system. The method will read the image using OpenCV (cv2
), pad it with letterbox()
, apply transforms to it, and turn it into a batch (the model is trained on and expects batches, as usual):
def run_inference(url):
image = cv2.imread(url) # shape: (480, 640, 3)
# Resize and pad image
image = letterbox(image, 960, stride=64, auto=True)[0] # shape: (768, 960, 3)
# Apply transforms
image = transforms.ToTensor()(image) # torch.Size([3, 768, 960])
# Turn image into batch
image = image.unsqueeze(0) # torch.Size([1, 3, 768, 960])
output, _ = model(image) # torch.Size([1, 45900, 57])
return output, image
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Here, we've returned the transformed image (because we'll want to extract the original and plot on it) and the outputs of the model. These outputs contain 45900 keypoint predictions, most of which overlap. We'll want to apply Non-Max Suppression to these raw predictions, just as with Object Detection predictions (where many bounding boxes are predicted and then they're "collapsed" given some confidence and IoU threshold). After suppression, we can plot each keypoint on the original image and display it:
def visualize_output(output, image):
output = non_max_suppression_kpt(output,
0.25, # Confidence Threshold
0.65, # IoU Threshold
nc=model.yaml['nc'], # Number of Classes
nkpt=model.yaml['nkpt'], # Number of Keypoints
kpt_label=True)
with torch.no_grad():
output = output_to_keypoint(output)
nimg = image[0].permute(1, 2, 0) * 255
nimg = nimg.cpu().numpy().astype(np.uint8)
nimg = cv2.cvtColor(nimg, cv2.COLOR_RGB2BGR)
for idx in range(output.shape[0]):
plot_skeleton_kpts(nimg, output[idx, 7:].T, 3)
plt.figure(figsize=(12, 12))
plt.axis('off')
plt.imshow(nimg)
plt.show()
Now, for some input image, such as karate.jpg
in the main working directory, we can run inference, perform Non-Max Suppression and plot the results with:
output, image = run_inference('../basketball.jpg') # Bryan Reyes on Unsplash
visualize_output(output, image)
This results in:
Or another one:
output, image = run_inference('../karate.jpg') # Mondo Generator on Unsplash
visualize_output(output, image)
This is a fairly difficult image to infer! Most of the right arm of the practitioner on the right is hidden, and we can see that the model inferred that it is hidden and to the right of the body, missing that the elbow is bent and that a portion of the arm is in front. The practitioner on the left, which is much more clearly seen, is inferred correctly, even with a hidden leg.
As a matter of fact - a person sitting in the back, almost fully invisible to the camera has had their pose seemingly correctly estimated, just based on the position of the hips while sitting down. Great work on behalf of the network!
Conclusion
In this guide - we've taken a brief look at YOLOv7, the latest advancement in the YOLO family, which builds on top of YOLOR, and further provides instance segmentation and keypoint detection capabilities beyond the standard object detection capabilities of most YOLO-based models.
We've then taken a look at how we can download released weight files, load them in to construct a model and perform pose estimation inference for humans, yielding impressive results.