Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided, Self-Consistent MLLMs for Task Planning in Instruction-Following Manipulation

¹Department of Computer Science, National Yang Ming Chiao Tung University
²Department of Computer Science and Information Engineering, National Taiwan University
^*Indicates Equal Contribution
CoRL 2025 Workshop iCare

Abstract

We investigate the use of Multimodal Large Language Models (MLLMs) with in-context learning for closed-loop task planning in instruction-following manipulation. We identify four essential requirements for successful task planning: quantity estimation, reachability analysis, relative positioning, and collision avoidance. However, existing benchmarks fail to support holistic evaluation across all these aspects. To address this gap, we introduce \textbf{QuARC} (Quantity, Analysis, Relative positioning, Collision), a new benchmark based on a food preparation scenario that integrates all four challenges. Using QuARC, we reveal two major limitations of current MLLMs: cross-modal distraction and geometric infeasibility. To tackle these, we adapt Chain-of-Thought with Self-Consistency to mitigate reasoning loss from cross-modal distractions and incorporate an affordance predictor to guide planning based on geometric feasibility. Our comprehensive evaluation analyzes performance across multiple baselines and explains sources of improvement. Our method achieves a 76.7\% success rate on the benchmark, significantly outperforming the ViLa baseline (36.7\%), without requiring additional finetuning.

BibTeX

@misc{shen2025mitigatingcrossmodaldistractionensuring, title={Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided, Self-Consistent MLLMs for Task Planning in Instruction-Following Manipulation}, author={Yu-Hong Shen and Chuan-Yu Wu and Yi-Ru Yang and Yen-Ling Tai and Yi-Ting Chen}, year={2025}, eprint={2503.13055}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2503.13055}, }

Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided, Self-Consistent MLLMs for Task Planning in Instruction-Following Manipulation

Abstract

Overview of our planning pipeline. MLLM Planning Stage generates a skill sequence, Self Consistency Verification for stabilizes skill selection, and Skill Affordance verifies geometric feasibility.

Failure cases.

Experiment results. SC stands for Self-Consistency and SA stands for Skill Affordance.

Method

Video Presentation

BibTeX