Real-Time Object Detection Inference in Python with YOLOv7


Object detection is a large field in computer vision, and one of the more important applications of computer vision "in the wild".

Object detection isn't as standardized as image classification, mainly because most of the new developments are typically done by individual researchers, maintainers and developers, rather than large libraries and frameworks. It's difficult to package the necessary utility scripts in a framework like TensorFlow or PyTorch and maintain the API guidelines that guided the development so far.

This makes object detection somewhat more complex, typically more verbose (but not always), and less approachable than image classification.

Fortunately for the masses - Ultralytics has developed a simple, very powerful and beautiful object detection API around their YOLOv5 which has been extended by other research and development teams into newer versions, such as YOLOv7.

In this short guide, we'll be performing Object Detection in Python, with state-of-the-art YOLOv7.

YOLO Landscape and YOLOv7

YOLO (You Only Look Once) is a methodology, as well as family of models built for object detection. Since the inception in 2015, YOLOv1, YOLOv2 (YOLO9000) and YOLOv3 have been proposed by the same author(s) - and the deep learning community continued with open-sourced advancements in the continuing years.

Ultralytics' YOLOv5 is the first large-scale implementation of YOLO in PyTorch, which made it more accessible than ever before, but the main reason YOLOv5 has gained such a foothold is also the beautifully simple and powerful API built around it. The project abstracts away the unnecessary details, while allowing customizability, practically all usable export formats, and employs amazing practices that make the entire project both efficient and as optimal as it can be.

YOLOv5 is still the staple project to build Object Detection models with, and many repositories that aim to advance the YOLO method start with YOLOv5 as a baseline and offer a similar API (or simply fork the project and build on top of it). Such is the case of YOLOR (You Only Learn One Representation) and YOLOv7 which built on top of YOLOR (same author). YOLOv7 is the latest advancement in the YOLO methodology and most notably, YOLOv7 provides new model heads that can output keypoints (skeletons) and perform instance segmentation besides only bounding box regression, which wasn't standard with previous YOLO models.

This makes instance segmentation and keypoint detection faster than ever before!

It was released alongside a paper named "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors", and the source code is available on GitHub.

In addition, YOLOv7 performs faster and to a higher degree of accuracy than previous models due to a reduced parameter count and higher computational efficiency:

The model itself was created through architectural changes, as well as optimizing aspects of training, dubbed "bag-of-freebies", which increased accuracy without increasing inference cost.

Installing YOLOv7

Installing and using YOLOv7 boils down to downloading the GitHub repository to your local machine and running the scripts that come packaged with it.

Note: Unfortunately, as of writing, YOLOv7 doesn't offer a clean programmatic API such as YOLOv5, that's typically loaded from torch.hub(), passing the GitHub repository in. This appears to be a feature that should work but is currently failing. As it gets fixed, I'll update the guide or publish a new one on the programmatic API. For now - we'll focus on the inference scripts provided in the repository.

Even so, you can perform detection in real-time on videos, images, etc. and save the results easily. The project follows the same conventions as YOLOv5, which has an extensive documentation, so you're likely to find answers to more niche questions in the YOLOv5 repository if you have some.

Let's download the repository and perform some inference:

! git clone

This creates a yolov7 directory in your current working directory, which houses the project. Let's move into that directory and take a look at the files:

%cd yolov7
/Users/macbookpro/jup/yolov7        models           tools        paper  
cfg              figure           requirements.txt
data          scripts          utils
deploy           inference          runs

Note: On a Google Colab Notebook, you'll have to run the magic %cd command in each cell you wish to change your directory to yolov7, while the next cell returns you back to your original working directory. On Local Jupyter Notebooks, changing the directory once keeps you in it, so there's no need to re-issue the command multiple times.

The is the inference script that runs detections and saves the results under runs/detect/video_name, where you can specify the video_name while calling the script. exports the model to various formats, such as ONNX, TFLite, etc. can be used to train a custom YOLOv7 detector (the topic of another guide), and can be used to test a detector (loaded from a weights file).

Several additional directories hold the configurations (cfg), example data (inference), data on constructing models and COCO configurations (data), etc.

YOLOv7 Sizes

YOLO-based models scale well, and are typically exported as smaller, less-accurate models, and larger, more-accurate models. These are then deployed to weaker or stronger devices respectively.

YOLOv7 offers several sizes, and benchmarked them against MS COCO:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Model Test Size APtest AP50test AP75test batch 1 fps batch 32 average time
YOLOv7 640 51.4% 69.7% 55.9% 161 fps 2.8 ms
YOLOv7-X 640 53.1% 71.2% 57.8% 114 fps 4.3 ms
YOLOv7-W6 1280 54.9% 72.6% 60.1% 84 fps 7.6 ms
YOLOv7-E6 1280 56.0% 73.5% 61.2% 56 fps 12.3 ms
YOLOv7-D6 1280 56.6% 74.0% 61.8% 44 fps 15.0 ms
YOLOv7-E6E 1280 56.8% 74.4% 62.1% 36 fps 18.7 ms

Depending on the underlying hardware you're expecting the model to run on, and the required accuracy - you can choose between them. The smallest model hits over 160 FPS on images of size 640, on a V100! You can expect satisfactory real-time performance on more common consumer GPUs as well.

Video Inference with YOLOv7

Create an inference-data folder to store the images and/or videos you'd like to detect from. Assuming it's in the same directory, we can run a detection script with:

! python3 --source inference-data/busy_street.mp4 --weights --name video_1 --view-img

This will prompt a Qt-based video on your desktop in which you can see the live progress and inference, frame by frame, as well as output the status to our standard output pipe:

Namespace(weights=[''], source='inference-data/busy_street.mp4', img_size=640, conf_thres=0.25, iou_thres=0.45, device='', view_img=True, save_txt=False, save_conf=False, nosave=False, classes=None, agnostic_nms=False, augment=False, update=False, project='runs/detect', name='video_1', exist_ok=False, no_trace=False)
YOLOR πŸš€ v0.1-112-g55b90e1 torch 1.12.1 CPU

Downloading to
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 72.1M/72.1M [00:18<00:00, 4.02MB/s]

Fusing layers... 
Model Summary: 306 layers, 36905341 parameters, 6652669 gradients
 Convert model to Traced-model... 
 traced_script_module saved! 
 model is traced! 
video 1/1 (1/402) /Users/macbookpro/jup/yolov7/inference-data/busy_street.mp4: 24 persons, 1 bicycle, 8 cars, 3 traffic lights, 2 backpacks, 2 handbags, Done. (1071.6ms) Inference, (2.4ms) NMS
video 1/1 (2/402) /Users/macbookpro/jup/yolov7/inference-data/busy_street.mp4: 24 persons, 1 bicycle, 8 cars, 3 traffic lights, 2 backpacks, 2 handbags, Done. (1070.8ms) Inference, (1.3ms) NMS

Note that the project will run slow on CPU-based machines (such as 1000ms per inference step in the output above, ran on an Intel-based 2017 MacBook Pro), and significantly faster on GPU-based machines (closer to ~5ms/frame on a V100). Even on CPU-based systems such as this one, runs at 172ms/frame, which while far from real-time, is still very decent for handling these operations on a CPU.

Once the run is done, you can find the resulting video under runs/video_1 (the name we supplied in the call), saved as an .mp4:

Inference on Images

Inference on images boils down to the same process - supplying the URL to an image in the file system, and calling

! python3 --source inference-data/desk.jpg --weights

Note: As of writing, the output doesn't scale the labels to the image size, even if you set --img SIZE. This means that large images will have really thin bounding box lines and small labels.


In this short guide - we've taken a brief look at YOLOv7, the latest advancement in the YOLO family, which builds on top of YOLOR. We've taken a look at how to install the repository on your local machine and run object detection inference scripts with a pre-trained network on videos and images.

In further guides, we'll be covering keypoint detection and instance segmentation.

Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

David LandupAuthor

Entrepreneur, Software and Machine Learning Engineer, with a deep fascination towards the application of Computation and Deep Learning in Life Sciences (Bioinformatics, Drug Discovery, Genomics), Neuroscience (Computational Neuroscience), robotics and BCIs.

Great passion for accessible education and promotion of reason, science, humanism, and progress.

Β© 2013-2024 Stack Abuse. All rights reserved.