Introduction
Object detection is a large field in computer vision, and one of the more important applications of computer vision "in the wild".
Object detection isn't as standardized as image classification, mainly because most of the new developments are typically done by individual researchers, maintainers and developers, rather than large libraries and frameworks. It's difficult to package the necessary utility scripts in a framework like TensorFlow or PyTorch and maintain the API guidelines that guided the development so far.
This makes object detection somewhat more complex, typically more verbose (but not always), and less approachable than image classification.
Fortunately for the masses - Ultralytics has developed a simple, very powerful and beautiful object detection API around their YOLOv5 which has been extended by other research and development teams into newer versions, such as YOLOv7.
In this short guide, we'll be performing Object Detection in Python, with state-of-the-art YOLOv7.
YOLO Landscape and YOLOv7
YOLO (You Only Look Once) is a methodology, as well as family of models built for object detection. Since the inception in 2015, YOLOv1, YOLOv2 (YOLO9000) and YOLOv3 have been proposed by the same author(s) - and the deep learning community continued with open-sourced advancements in the continuing years.
Ultralytics' YOLOv5 is the first large-scale implementation of YOLO in PyTorch, which made it more accessible than ever before, but the main reason YOLOv5 has gained such a foothold is also the beautifully simple and powerful API built around it. The project abstracts away the unnecessary details, while allowing customizability, practically all usable export formats, and employs amazing practices that make the entire project both efficient and as optimal as it can be.
YOLOv5 is still the staple project to build Object Detection models with, and many repositories that aim to advance the YOLO method start with YOLOv5 as a baseline and offer a similar API (or simply fork the project and build on top of it). Such is the case of YOLOR (You Only Learn One Representation) and YOLOv7 which built on top of YOLOR (same author). YOLOv7 is the latest advancement in the YOLO methodology and most notably, YOLOv7 provides new model heads that can output keypoints (skeletons) and perform instance segmentation besides only bounding box regression, which wasn't standard with previous YOLO models.
This makes instance segmentation and keypoint detection faster than ever before!
It was released alongside a paper named "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors", and the source code is available on GitHub.
In addition, YOLOv7 performs faster and to a higher degree of accuracy than previous models due to a reduced parameter count and higher computational efficiency:
The model itself was created through architectural changes, as well as optimizing aspects of training, dubbed "bag-of-freebies", which increased accuracy without increasing inference cost.
Installing YOLOv7
Installing and using YOLOv7 boils down to downloading the GitHub repository to your local machine and running the scripts that come packaged with it.
Note: Unfortunately, as of writing, YOLOv7 doesn't offer a clean programmatic API such as YOLOv5, that's typically loaded from torch.hub()
, passing the GitHub repository in. This appears to be a feature that should work but is currently failing. As it gets fixed, I'll update the guide or publish a new one on the programmatic API. For now - we'll focus on the inference scripts provided in the repository.
Even so, you can perform detection in real-time on videos, images, etc. and save the results easily. The project follows the same conventions as YOLOv5, which has an extensive documentation, so you're likely to find answers to more niche questions in the YOLOv5 repository if you have some.
Let's download the repository and perform some inference:
! git clone https://github.com/WongKinYiu/yolov7.git
This creates a yolov7
directory in your current working directory, which houses the project. Let's move into that directory and take a look at the files:
%cd yolov7
!ls
/Users/macbookpro/jup/yolov7
LICENSE.md detect.py models tools
README.md export.py paper train.py
cfg figure requirements.txt train_aux.py
data hubconf.py scripts utils
deploy inference test.py runs
Note: On a Google Colab Notebook, you'll have to run the magic %cd
command in each cell you wish to change your directory to yolov7
, while the next cell returns you back to your original working directory. On Local Jupyter Notebooks, changing the directory once keeps you in it, so there's no need to re-issue the command multiple times.
The detect.py
is the inference script that runs detections and saves the results under runs/detect/video_name
, where you can specify the video_name
while calling the detect.py
script. export.py
exports the model to various formats, such as ONNX, TFLite, etc. train.py
can be used to train a custom YOLOv7 detector (the topic of another guide), and test.py
can be used to test a detector (loaded from a weights file).
Several additional directories hold the configurations (cfg
), example data (inference
), data on constructing models and COCO configurations (data
), etc.
YOLOv7 Sizes
YOLO-based models scale well, and are typically exported as smaller, less-accurate models, and larger, more-accurate models. These are then deployed to weaker or stronger devices respectively.
YOLOv7 offers several sizes, and benchmarked them against MS COCO:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Model | Test Size | APtest | AP50test | AP75test | batch 1 fps | batch 32 average time |
---|---|---|---|---|---|---|
YOLOv7 | 640 | 51.4% | 69.7% | 55.9% | 161 fps | 2.8 ms |
YOLOv7-X | 640 | 53.1% | 71.2% | 57.8% | 114 fps | 4.3 ms |
YOLOv7-W6 | 1280 | 54.9% | 72.6% | 60.1% | 84 fps | 7.6 ms |
YOLOv7-E6 | 1280 | 56.0% | 73.5% | 61.2% | 56 fps | 12.3 ms |
YOLOv7-D6 | 1280 | 56.6% | 74.0% | 61.8% | 44 fps | 15.0 ms |
YOLOv7-E6E | 1280 | 56.8% | 74.4% | 62.1% | 36 fps | 18.7 ms |
Depending on the underlying hardware you're expecting the model to run on, and the required accuracy - you can choose between them. The smallest model hits over 160 FPS on images of size 640, on a V100! You can expect satisfactory real-time performance on more common consumer GPUs as well.
Video Inference with YOLOv7
Create an inference-data
folder to store the images and/or videos you'd like to detect from. Assuming it's in the same directory, we can run a detection script with:
! python3 detect.py --source inference-data/busy_street.mp4 --weights yolov7.pt --name video_1 --view-img
This will prompt a Qt-based video on your desktop in which you can see the live progress and inference, frame by frame, as well as output the status to our standard output pipe:
Namespace(weights=['yolov7.pt'], source='inference-data/busy_street.mp4', img_size=640, conf_thres=0.25, iou_thres=0.45, device='', view_img=True, save_txt=False, save_conf=False, nosave=False, classes=None, agnostic_nms=False, augment=False, update=False, project='runs/detect', name='video_1', exist_ok=False, no_trace=False)
YOLOR π v0.1-112-g55b90e1 torch 1.12.1 CPU
Downloading https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt to yolov7.pt...
100%|ββββββββββββββββββββββββββββββββββββββ| 72.1M/72.1M [00:18<00:00, 4.02MB/s]
Fusing layers...
RepConv.fuse_repvgg_block
RepConv.fuse_repvgg_block
RepConv.fuse_repvgg_block
Model Summary: 306 layers, 36905341 parameters, 6652669 gradients
Convert model to Traced-model...
traced_script_module saved!
model is traced!
video 1/1 (1/402) /Users/macbookpro/jup/yolov7/inference-data/busy_street.mp4: 24 persons, 1 bicycle, 8 cars, 3 traffic lights, 2 backpacks, 2 handbags, Done. (1071.6ms) Inference, (2.4ms) NMS
video 1/1 (2/402) /Users/macbookpro/jup/yolov7/inference-data/busy_street.mp4: 24 persons, 1 bicycle, 8 cars, 3 traffic lights, 2 backpacks, 2 handbags, Done. (1070.8ms) Inference, (1.3ms) NMS
Note that the project will run slow on CPU-based machines (such as 1000ms per inference step in the output above, ran on an Intel-based 2017 MacBook Pro), and significantly faster on GPU-based machines (closer to ~5ms/frame on a V100). Even on CPU-based systems such as this one, yolov7-tiny.pt
runs at 172ms/frame
, which while far from real-time, is still very decent for handling these operations on a CPU.
Once the run is done, you can find the resulting video under runs/video_1
(the name we supplied in the detect.py
call), saved as an .mp4
:
Inference on Images
Inference on images boils down to the same process - supplying the URL to an image in the file system, and calling detect.py
:
! python3 detect.py --source inference-data/desk.jpg --weights yolov7.pt
Note: As of writing, the output doesn't scale the labels to the image size, even if you set --img SIZE
. This means that large images will have really thin bounding box lines and small labels.
Conclusion
In this short guide - we've taken a brief look at YOLOv7, the latest advancement in the YOLO family, which builds on top of YOLOR. We've taken a look at how to install the repository on your local machine and run object detection inference scripts with a pre-trained network on videos and images.
In further guides, we'll be covering keypoint detection and instance segmentation.