WALDO 3.0 (Whereabouts Ascertainment for Low-lying Detectable Objects) is an open-source artificial intelligence model designed for object detection in overhead imagery. Built upon the YOLOv8 (You Only Look Once, General object detection tool) architecture, WALDO 3.0 leverages synthetic data and “augmented” / semi-synthetic data pipelines to enhance its object recognition capabilities. It is a powerful tool for analyzing aerial imagery from drones, satellites and other overhead sources. The model can detect classes of items in overhead imagery ranging in altitude from about 30 feet to satellite imagery.
The development of WALDO 3.0 was driven by the need for an efficient, high-precision model capable of identifying various objects in aerial images. Traditional object detection methods struggle with variations in scale, lighting, and occlusion, which are common challenges in overhead imagery. WALDO 3.0 addresses these issues by using:
WALDO 3.0 supports the detection of 12 distinct object classes:
Able to accurately detect various objects in overhead imagery, WALDO 3.0 has applications in multiple fields:
Higher resolutions yield better accuracy.
Small or partially hidden objects may be more challenging to detect.
Extreme weather or poor lighting can impact detection efficiency.
This implementation focuses to connect with a live camera feed, either from a drone or a CCTV system, to detect objects in a specific area.
Initially, we considered using Waldo 3.0 for object detection. Since Waldo 3.0 also relies on YOLO (You Only Look Once), we opted to use YOLOv11 directly for object detection due to its robustness and extensive community support.
For the initial testing phase, instead of using live feeds, we used MP4 video files to inspect and evaluate the results of the object detection tools. This approach allowed us to validate the performance of YOLO before integrating it with real-time camera streams.
For the initial testing phase, MP4 video files were utilized to inspect and evaluate the results of the object detection. This approach allowed us to validate the performance of YOLO before integrating it with real-time camera streams.
A variety of MP4 videos with different resolutions and pixel sizes to evaluate YOLO’s performance across various video qualities. Additionally, some of the videos are converted to a uniform resolution of 640×640 pixels to assess how YOLO performs when the input size is standardized.
The output successfully detects multiple objects and tracks them across frames. However, due to video format inconsistencies, YOLO occasionally hallucinates objects—detecting items that aren’t actually present. Pre-processing the video, such as normalizing resolution, adjusting brightness/contrast, and stabilizing frames, can significantly enhance YOLO’s performance, reducing false detections and improving tracking accuracy.
YOLO is a state-of-the-art, real-time object detection system that offers high accuracy and speed. Unlike traditional object detection techniques that rely on region proposal networks (R-CNN, Fast R-CNN), YOLO frames the detection problem as a single regression problem. It predicts bounding boxes and class probabilities directly from an image in a single evaluation.
For YOLO to accurately detect objects, preprocessing will be performed as follows:
https://docs.ultralytics.com/guides/preprocessing_annotated_data/#data-preprocessing-techniques
https://docs.ultralytics.com/models/yolov7/
Resizing the video to 640×640 can improve YOLO’s object detection performance by ensuring the model receives inputs in the expected format. However, resizing must be done correctly to avoid loss of details or distortion.
Below is an example of YOLO detecting objects in a traffic scene. The left side shows the processed frame with detected objects labeled, while the right side displays the original frame.
YOLOv11 integration for real-time object detection using live camera feeds from drones or CCTV systems. Initially, MP4 videos were used for testing, revealing that while YOLO accurately detects and tracks multiple objects, video format inconsistencies sometimes cause hallucinated detections. Pre-processing techniques such as resolution normalization, FPS adjustment, and frame stabilization can significantly enhance accuracy. By fine-tuning YOLO, optimizing video processing, and integrating tracking algorithms like DeepSORT, the system can improve detection reliability. Future steps include refining model performance, incorporating additional AI models, and integrating reporting for real-time analytics.