3D Object Detection (3D Bounding Boxes) in Python with MediaPipe Objectron


Object detection has been gaining steam, and improvements are being made to several approaches to solving it. In the past couple of years, YOLO-based methods have been outperforming others in terms of accuracy and speed, with recent advancements such as YOLOv7 and YOLOv6 (which was released independently, after YOLOv7).

However - all of these are concerning 2D object detection, which is a difficult task in and of itself. Recently, we've been able to successfully perform 3D object detection, and while these detectors are still at a more unstable stage than 2D object detectors, their accuracy is rising.

In this guide, we'll be performing 3D object detection in Python with MediaPipe's Objectron.

Note: MediaPipe is Google's open source framework for building machine learning pipelines to process images, videos and audio streams, primarily for mobile devices. It's being used both internally and externally, and provides pre-trained models for various tasks, such as face detection, face meshing, hand and pose estimation, hair segmentation, object detection, box tracking, etc.

All of these can and are used for downstream tasks - such as applying filters to faces, automated camera focusing, biometric verification, hand-controlled robotics, etc. Most projects are available with APIs for Android, iOS, C++, Python and JavaScript, while some are only available for certain languages.

In this guide, we'll be working with MediaPipe's Objectron, available for Android, C++, Python and JavaScript.

MediaPipe and 3D Object Detection

The Objectron solution was trained on the Objectron Dataset, which contains short object-centric videos. The dataset only contains 9 objects: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops and shoes, so it's not a very general dataset, but the processing and procurement of these videos is fairly expensive (camera poses, sparse point-clouds, characterization of the planar surfaces, etc. for each frame of each video), making the dataset nearly 2 terabytes in size.

The trained Objectron model (known as a solution for MediaPipe projects) is trained on four categories - shoes, chairs, mugs and cameras.

2D object detection uses the term "bounding boxes", while they're actually rectangles. 3D object detection actually predicts boxes around objects, from which you can infer their orientation, size, rough volume, etc. This is a fairly difficult task to take on, especially given the lack of appropriate datasets and the cost of creating them. While difficult, the problem holds promise for various Augmented Reality (AR) applications!

The Objectron solution can run in a one-stage or two-stage mode - where the one-stage mode is better at detecting multiple objects, while the two-stage mode is better at detecting a single main object in the scene, and runs significantly faster. The single-stage pipeline uses a MobileNetV2 backbone, while the two-stage pipeline uses the TensorFlow Object Detection API.

When an object is detected in a video, further predictions aren't made for it on each frame for two reasons:

  • Continuous predictions introduce high jitteriness (due to the inherent stochasticity in the predictions)
  • It's expensive to run large models on every frame

The team offloads the heavy predictions to first encounters only and then tracks that box as long as the object in question is still in the scene. Once the line of sight is broken and the object is re-introduced, a prediction is made again. This makes it possible to use larger models with higher accuracy, while keeping the computational requirements low, and lowers the hardware requirements for real-time inference!

Let's go ahead and install MediaPipe, import the Objectron solution and apply it to static images and a video feed coming straight from a camera.

Installing MediaPipe

Let's first install MediaPipe and prepare a helper method to fetch images from a given URL:

$ ! pip install mediapipe

With the framework installed, let's import it alongside common libraries:

import mediapipe as mp

import cv2
import numpy as np
import matplotlib.pyplot as plt

Let's define a helper method to fetch images given a URL and which returns an RGB array representing that image:

import PIL
import urllib

def url_to_array(url):
    req = urllib.request.urlopen(url)
    arr = np.array(bytearray(req.read()), dtype=np.int8)
    arr = cv2.imdecode(arr, -1)
    arr = cv2.cvtColor(arr, cv2.COLOR_BGR2RGB)
    return arr

mug = 'https://goodstock.photos/wp-content/uploads/2018/01/Laptop-Coffee-Mug-on-Table.jpg'
mug = url_to_array(mug)

Finally, we'll want to import both the Objectron solution and the drawing utilities to visualize predictions:

mp_objectron = mp.solutions.objectron
mp_drawing = mp.solutions.drawing_utils

3D Object Detection on Static Images with MediaPipe

The Objectron class allows for several arguments, including:

  • static_image_mode: Whether you're feeding in an image or a stream of images (video)
  • max_num_objects: The maximum identifiable number of objects
  • min_detection_confidence: The detection confidence threshold (how sure the network has to be to classify an object for the given class)
  • model_name: Which model you'd like to load in between 'Cup', 'Shoe', 'Camera' and 'Chair'.

With those in mind - let's instantiate an Objectron instance and process() the input image:

# Instantiation
objectron = mp_objectron.Objectron(

# Inference
results = objectron.process(mug)

The results contain the 2D and 3D landmarks of the detected object(s) as well as the rotation, translation and scale for each. We can process the results and draw the bounding boxes fairly easily using the provided drawing utils:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

if not results.detected_objects:
    print(f'No box landmarks detected.')

# Copy image so as not to draw on the original one
annotated_image = mug.copy()
for detected_object in results.detected_objects:
    # Draw landmarks

    # Draw axis based on rotation and translation
# Plot result
fig, ax = plt.subplots(figsize=(10, 10))

This results in:

3D Object Detection from Video or Web-cam with MediaPipe

A more exciting application is on videos! You don't have to change the code much to accommodate videos, whether you're providing one from the web-cam or an existing video file. OpenCV is a natural fit for reading, manipulating and feeding video frames into the objectron model:

# VideoCapture(0) tries to capture from the webcam
cap = cv2.VideoCapture(0)
# Or
# cap = cv2.VideoCapture('filename.mp4')

objectron = mp_objectron.Objectron(static_image_mode=False,
# Read video stream and feed into the model
while cap.isOpened():
    success, image = cap.read()

    image.flags.writeable = False
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    results = objectron.process(image)

    image.flags.writeable = True
    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
    if results.detected_objects:
        for detected_object in results.detected_objects:

    cv2.imshow('MediaPipe Objectron', cv2.flip(image, 1))
    if cv2.waitKey(10) & 0xFF == ord('q'):

Making the image non-writable with image.flags.writeable = False makes the process run somewhat faster, and is an optional change. The final cv2.flip() on the resulting image is also optional - and simply makes the output mirrored to make it a bit more intuitive.

When run on a camera and a globally common IKEA mug, these are the results:

The output is slightly jittery, but handles rotational translation well, even with a shaky hand holding the low-resolution camera. What happens when an object is taken out of the frame?

The predictions stop for the object at the first detection, and box tracking clearly picks up that the object has left the frame, and performs the prediction and tracking once again as soon as the object re-enters the frame. It appears that the tracking works somewhat better when the model can see the mug handle, as the outputs are more jittery when the handle is not visible (presumably because it's harder to accurately ascertain the true orientation of the mug).

Additionally, some angles seem to produce significantly more stable outputs than others, in challenging light conditions. For mugs specifically, it helps to be able to see the lip of the mug as it helps with perspective, rather than seeing an orthogonal projection of the object.

Additionally, when tested on a transparent mug, the model had difficulties ascertaining it as a mug. This is likely an example of an out of distribution object, as most mugs are opaque and have various colors.


3D object detection is still somewhat young, and MediaPipe's Objectron is a capable demonstration! While sensitive to lighting conditions, object types (transparent vs opaque mugs, etc.) and slightly jittery - Objectron is a good glimpse into what will soon be possible to do with higher accuracy and accessibility than ever before.

Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

David LandupAuthor

Entrepreneur, Software and Machine Learning Engineer, with a deep fascination towards the application of Computation and Deep Learning in Life Sciences (Bioinformatics, Drug Discovery, Genomics), Neuroscience (Computational Neuroscience), robotics and BCIs.

Great passion for accessible education and promotion of reason, science, humanism, and progress.

20% off

Practical Deep Learning for Computer Vision with Python

# tensorflow# computer vision# Object Detection# deep learning

DeepDream with TensorFlow/Keras Keypoint Detection with Detectron2 Image Captioning with KerasNLP Transformers and ConvNets Semantic Segmentation with DeepLabV3+ in Keras Real-Time Object Detection from...

David Landup
Jovana Ninkovic

Building Your First Convolutional Neural Network With Keras

# python# machine learning# keras# tensorflow

Most resources start with pristine datasets, start at importing and finish at validation. There's much more to know. Why was a class predicted? Where was...

David Landup
David Landup

© 2013-2024 Stack Abuse. All rights reserved.