This study aims to realize long-horizon robotic manipulation guided by implicit instructions that convey real intent through metaphors, emotional expressions and other indirect means.
First, this study proposed shared attributes to enhance the ability of visual-language models (VLMs) in reasoning explicit intentions from implicit instructions. Specifically, the VLM was fine-tuned by adding shared attributes. These attributes are derived from those with the highest similarity extracted from images and explicit instructions, bridging intrinsic cross-modal semantic mappings between implicit expressions and explicit intentions. Owing to the lack of relevant data, an implicit instruction-based data set was constructed for fine-tuning the VLMs. Then, a hierarchical learning strategy was introduced to map explicit instructions to robotic controller parameters through a planning module, a sequencing module and an interaction learning module.
In these experiments, the fine-tuned VLM achieved state-of-the-art performance on both this study’s constructed data set and the public VAGUE benchmark and successfully executed ten implicit-instruction-guided robotic manipulation tasks in simulation and eight in the real world.
This work integrates implicit instructions into robot manipulation. To the best of the authors’ knowledge, this if the first study to introduce long-horizon robotic manipulation guided by implicit instructions. The authors propose to fine-tune the VLM by adding shared attributes, bridging intrinsic cross-modal semantic mappings between implicit expressions and explicit intentions. This study further introduces a hierarchical learning strategy to enable efficient transformation from implicit instructions to executable operations. This approach provides new perspectives on language-conditioned robotic manipulation and has the potential to be extended to a wide range of human-centered manipulation tasks. The video can be found at here.
