Object detection has been gaining steam, and improvements are being made to several approaches to solving it. In the past couple of years, YOLO-based methods have been outperforming others in terms of accuracy and speed, with recent advancements such as YOLOv7 and YOLOv6 (which was released independently, after YOLOv7).
However - all of these are concerning 2D object detection, which is a difficult task in and of itself. Recently, we've been able to successfuly perform 3D object detection, and while these detectors are still at a more unstable stage than 2D object detectors, their accuracy is rising.
In this guide, we'll be performing 3D object detection in Python with MediaPipe's Objectron.
Note: MediaPipe is Google's open source framework for building machine learning pipelines to process images, videos and audio streams, primarily for mobile devices. It's being used both internally and externally, and provides pre-trained models for various tasks, such as face detection, face meshing, hand and pose estimation, hair segmentation, object detection, box tracking, etc.
MediaPipe and 3D Object Detection
The Objectron solution was trained on the Objectron Dataset, which contains short object-centric videos. The dataset only contains 9 objects: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops and shoes, so it's not a very general dataset, but the processing and procurement of these videos is fairly expensive (camera poses, sparse point-clouds, characterization of the planar surfaces, etc. for each frame of each video), making the dataset nearly 2 terrabytes in size.
The trained Objectron model (known as a solution for MediaPipe projects) is trained on four categories - shoes, chairs, mugs and cameras.
2D object detection uses the term "bounding boxes", while they're actually rectangles. 3D object detection actually predicts boxes around objects, from which you can infer their orientation, size, rough volume, etc. This is a fairly difficult task to take on, especially given the lack of appropriate datasets and the cost of creating them. While difficult, the problem holds promise for various Augmented Reality (AR) applications!
The Objectron solution can run in a one-stage or two-stage mode - where the one-stage mode is better at detecting multiple objects, while the two-stage mode is better at detecting a single main object in the scene, and runs significantly faster. The single-stage pipeline uses a MobileNetV2 backbone, while the two-stage pipeline uses the TensorFlow Object Detection API.
When an object is detected in a video, further predictions aren't made for it on each frame for two reasons:
- Continuous predictions introduce high jitteriness (due to the inherent stochasticity in the predictions)
- It's expensive to run large models on every frame
The team offloads the heavy predictions to first encounters only and then tracks that box as long as the object in question is still in the scene. Once the line of sight is broken and the object is re-introduced, a prediction is made again. This makes it possible to use larger models with higher accuracy, while keeping the computational requirements low, and lowers the harware requirements for real-time inference!
Let's go ahead and install MediaPipe, import the Objectron solution and apply it to static images and a video feed coming straight from a camera.
Let's first install MediaPipe and prepare a helper method to fetch images from a given URL:
! pip install mediapipe
With the framework installed, let's import it alongside common libraries:
import mediapipe as mp import cv2 import numpy as np import matplotlib.pyplot as plt
Let's define a helper method to fetch images given a URL and which returns an RGB array representing that image:
import PIL import urllib def url_to_array(url): req = urllib.request.urlopen(url) arr = np.array(bytearray(req.read()), dtype=np.int8) arr = cv2.imdecode(arr, -1) arr = cv2.cvtColor(arr, cv2.COLOR_BGR2RGB) return arr mug = 'https://goodstock.photos/wp-content/uploads/2018/01/Laptop-Coffee-Mug-on-Table.jpg' mug = url_to_array(mug)
Finally, we'll want to import both the Objectron solution and the drawing utilities to visualize predictions:
mp_objectron = mp.solutions.objectron mp_drawing = mp.solutions.drawing_utils
3D Object Detection on Static Images with MediaPipe
Objectron class allows for several arguments, including:
static_image_mode: Whether you're feeding in an image or a stream of images (video)
max_num_objects: The maximum identifiable number of objects
min_detection_confidence: The detection confidence threshold (how sure the network has to be to classify an object for the given class)
model_name: Which model you'd like to load in between
With those in mind - let's instantiate an Objectron instance and
process() the input image:
# Instantiation objectron = mp_objectron.Objectron( static_image_mode=True, max_num_objects=5, min_detection_confidence=0.2, model_name='Cup') # Inference results = objectron.process(mug)
results contain the 2D and 3D landmarks of the detected object(s) as well as the rotation, translation and scale for each. We can process the results and draw the bounding boxes fairly easily using the provided drawing utils:
Free eBook: Git Essentials
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
if not results.detected_objects: print(f'No box landmarks detected.') # Copy image so as not to draw on the original one annotated_image = mug.copy() for detected_object in results.detected_objects: # Draw landmarks mp_drawing.draw_landmarks(annotated_image, detected_object.landmarks_2d, mp_objectron.BOX_CONNECTIONS) # Draw axis based on rotation and translation mp_drawing.draw_axis(annotated_image, detected_object.rotation, detected_object.translation) # Plot result fig, ax = plt.subplots(figsize=(10, 10)) ax.imshow(annotated_image) ax.axis('off') plt.show()
This results in:
3D Object Detection from Video or Webcam with MediaPipe
A more exciting application is on videos! You don't have to change the code much to accomomodate videos, whether you're providing one from the webcam or an existing video file. OpenCV is a natural fit for reading, manipulating and feeding video frames into the objectron model:
# VideoCapture(0) tries to capture from the webcam cap = cv2.VideoCapture(0) # Or # cap = cv2.VideoCapture('filename.mp4') objectron = mp_objectron.Objectron(static_image_mode=False, max_num_objects=5, min_detection_confidence=0.4, min_tracking_confidence=0.70, model_name='Cup') # Read video stream and feed into the model while cap.isOpened(): success, image = cap.read() image.flags.writeable = False image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) results = objectron.process(image) image.flags.writeable = True image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR) if results.detected_objects: for detected_object in results.detected_objects: mp_drawing.draw_landmarks(image, detected_object.landmarks_2d, mp_objectron.BOX_CONNECTIONS) mp_drawing.draw_axis(image, detected_object.rotation, detected_object.translation) cv2.imshow('MediaPipe Objectron', cv2.flip(image, 1)) if cv2.waitKey(10) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows()
Making the image non-writeable with
image.flags.writeable = False makes the process run somewhat faster, and is an optional change. The final
cv2.flip() on the resulting image is also optional - and simply makes the output mirrored to make it a bit more intuitive.
When run on a camera and a globally common Ikea mug, these are the results:
The output is slightly jittery, but handles rotational translation well, even with a shaky hand holding the low-resolution camera. What happens when an object is taken out of the frame?
The predictions stop for the object at the first detection, and box tracking clearly picks up that the object has left the frame, and performs the prediction and tracking once again as soon as the object re-enters the frame. It appears that the tracking works somewhat better when the model can see the mug handle, as the outputs are more jittery when the handle is not visible (presumably because it's harder to accurately ascertain the true orientation of the mug).
Additionally, some angles seem to produce significantly more stable outputs than others, in challenging light conditions. For mugs specifically, it helps to be able to see the lip of the mug as it helps with perspective, rather than seeing an orthogonal projection of the object.
Additionally, when tested on a transparent mug, the model had difficulties ascertaining it as a mug. This is likely an example of an out of distribution object, as most mugs are opaque and have various colors.
3D object detection is still somewhat young, and MediaPipe's Objectron is a capable demonstration! While sensitive to lighting conditions, object types (transparent vs opaque mugs, etc.) and slightly jittery - Objectron is a good glimpse into what will soon be possible to do with higher accuracy and accessibility than ever before.