Real-time object detection is central to the development of intelligent transportation
systems, autonomous vehicles, smart city monitoring, and pedestrian safety functionalities.
Among several deep learning-based approaches, the You Only Look Once (YOLO) series
of object detectors has been among the top choices for a long time due to the tradeoff it has attained between accuracy and inference speed. This thesis gives an in-depth
review and experimental analysis of YOLO versions 7–12, utilized for the use of real-time
vehicle and pedestrian object detection. While earlier versions of YOLO were performance
competitive, they were poor at occlusion, low-light, and detection of small objects. In order
to overcome these weaknesses, this paper proposes a novel YOLOv12-hybrid feature fusion
model that integrates transformer-based attention mechanisms, bidirectional multi-scale
feature aggregation, and RGB, depth, and semantic segmentation cross-modal input fusion.
Large-scale experiments were conducted on the COCO 2017 dataset, comparing each iteration
of YOLO based on mean Average Precision (mAP), training loss convergence, and inference
speed (FPS). The results establish that YOLOv12 surpasses previous models, with an mAP
of 88.2% and over 47 FPS inference rates, and yet offers consistent detection in challenging
urban settings. Contrast against the traditional and state-of-the-art models also indicates
the dominance of YOLOv12 for real-world deployment. This work not only establishes the
benchmark for YOLO detector development but also offers a scalable, accurate, and real-time
enabled model structure tailored for safety-critical use cases in traffic monitoring, autonomous
vehicles, and smart infrastructure systems.
Keywords: YOLO, object detection, hybrid feature fusion, FPS, CNN, FPN, deep learning,
traffic systems |