Object detection and instance segmentation play an important role in autonomous driving, where vehicles must perceive their surroundings reliably. In practice, these tasks are commonly addressed using separate models, which increases both training complexity and deployment cost. To overcome this issue, we propose UniPercepNet-S, a lightweight dual-task framework inspired by YOLOF that brings detection and segmentation into a single unified network, aiming to support real-time perception in resource-constrained environments.
UniPercepNet-S follows a YOLOF-style one-level detection design and strengthens the backbone with a channel attention module to improve feature quality. To enable instance segmentation, we add a simple yet efficient mask prediction branch that operates directly on detected objects while keeping computation low. We evaluate the proposed framework on MS COCO and BDD100 K, covering both general object segmentation and autonomous-driving-oriented scenarios.
The proposed UniPercepNet-S achieves a mask AP of 38.0 on MS COCO, placing it among the top-performing entries in the COCO Detection Challenge for segmentation tasks. On BDD100 K, which reflects real-world driving conditions, the model reaches an AP of 20.3, showing that it generalizes well across different datasets. These results suggest that UniPercepNet-S can deliver accurate detection and segmentation while remaining suitable for real-time use.
This work contributes a unified and lightweight one-level framework that performs object detection and instance segmentation simultaneously, avoiding the need for heavy multi-scale architectures or separate task-specific models. By combining attention-enhanced representations with an efficient segmentation branch, UniPercepNet-S provides a practical solution for real-time perception. Its balance between simplicity, accuracy, and speed makes it especially valuable for autonomous driving and other embedded vision applications.
