| Real-time robotic perception on resource-constrained hardware demands reliable object
detection where RGB-only systems routinely fail — specular surfaces, low texture, motion blur,
and occlusion. This thesis presents a Mixture-of-Experts (MoE) perception architecture fusing a
TensorRT-accelerated YOLOv8 segmentation expert with a GPU-based depth expert, coordinated
by a deterministic gating mechanism that routes fusion decisions per-instance at inference time.
Unlike existing RGB-D fusion approaches requiring joint training on annotated datasets, the
proposed system is training-free at fusion time. A GPU RANSAC support plane fit serves as a live
depth reliability signal, enabling per-instance routing — RGB-led, depth-assisted, or depth-proposed
— without modifying the underlying RGB detector.
The system is evaluated across four areas: detection robustness, contextual reasoning,
embedded real-time performance, and interpretability. Under RGB failure conditions including
missing backgrounds, darkness, motion blur, and occlusion, RGB-only recall ranges from 2.2% to
79.8%, while the proposed MoE achieves 63.6% to 96.4% recall and recovers 60.0% to 78.9% of
objects missed by RGB-only detection. Gate statistics indicate 97.9–100% RGB-led operation under
stable conditions, rising to 40.1–87.7% depth-led under failure, consistent with modality-adaptive
routing and no learned signal. Performance benchmarks on the Jetson Orin NX and AGX Xavier
demonstrate 10.8–36.0 FPS and 8.5–30.2 FPS respectively, meeting soft real-time requirements
under worst-case clutter. Observed failure modes are diagnosable from intermediate gate signals,
supporting interpretability claims.
Keywords: RGB-D perception, mixture of experts, embedded robotics, instance segmentation,
sensor fusion, interpretability |