We investigate the use of Multimodal Large Language Models (MLLMs) with in-context learning for closed-loop task planning in instruction-following manipulation. We identify four essential requirements for successful task planning: quantity estimation, reachability analysis, relative positioning, and collision avoidance. However, existing benchmarks fail to support holistic evaluation across all these aspects. To address this gap, we introduce \textbf{QuARC} (Quantity, Analysis, Relative positioning, Collision), a new benchmark based on a food preparation scenario that integrates all four challenges. Using QuARC, we reveal two major limitations of current MLLMs: cross-modal distraction and geometric infeasibility. To tackle these, we adapt Chain-of-Thought with Self-Consistency to mitigate reasoning loss from cross-modal distractions and incorporate an affordance predictor to guide planning based on geometric feasibility. Our comprehensive evaluation analyzes performance across multiple baselines and explains sources of improvement. Our method achieves a 76.7\% success rate on the benchmark, significantly outperforming the ViLa baseline (36.7\%), without requiring additional finetuning.
@misc{shen2025mitigatingcrossmodaldistractionensuring,
title={Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided, Self-Consistent MLLMs for Task Planning in Instruction-Following Manipulation},
author={Yu-Hong Shen and Chuan-Yu Wu and Yi-Ru Yang and Yen-Ling Tai and Yi-Ting Chen},
year={2025},
eprint={2503.13055},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2503.13055},
}