Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided, Self-Consistent MLLMs for Task Planning in Instruction-Following Manipulation

1Department of Computer Science, National Yang Ming Chiao Tung University
2Department of Computer Science and Information Engineering, National Taiwan University
*Indicates Equal Contribution

CoRL 2025 Workshop iCare

We identify four categories of failures when using Multimodal Large Language Models (MLLM) with in-context learning for food preparation task planning. MLLMs fail to compare quantities between bowls, identify which bowls need repositioning before scooping, recognize spatial relationships between objects, and consider moving objects to avoid collisions. As a result, the robots may not follow the instructions properly and might even spill the bowl.

Abstract

We investigate the use of Multimodal Large Language Models (MLLMs) with in-context learning for closed-loop task planning in instruction-following manipulation. We identify four essential requirements for successful task planning: quantity estimation, reachability analysis, relative positioning, and collision avoidance. However, existing benchmarks fail to support holistic evaluation across all these aspects. To address this gap, we introduce \textbf{QuARC} (Quantity, Analysis, Relative positioning, Collision), a new benchmark based on a food preparation scenario that integrates all four challenges. Using QuARC, we reveal two major limitations of current MLLMs: cross-modal distraction and geometric infeasibility. To tackle these, we adapt Chain-of-Thought with Self-Consistency to mitigate reasoning loss from cross-modal distractions and incorporate an affordance predictor to guide planning based on geometric feasibility. Our comprehensive evaluation analyzes performance across multiple baselines and explains sources of improvement. Our method achieves a 76.7\% success rate on the benchmark, significantly outperforming the ViLa baseline (36.7\%), without requiring additional finetuning.

Method

Video Presentation

BibTeX

@misc{shen2025mitigatingcrossmodaldistractionensuring,
    title={Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided, Self-Consistent MLLMs for Task Planning in Instruction-Following Manipulation}, 
    author={Yu-Hong Shen and Chuan-Yu Wu and Yi-Ru Yang and Yen-Ling Tai and Yi-Ting Chen},
    year={2025},
    eprint={2503.13055},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2503.13055}, 
}