文献收藏与分享平台

符毓 (2025-05-31 22:59):

#paper doi: 10.48550/arXiv.2505.21906, 2025, Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge. 视觉-语言-动作 (VLA) 模型已成为机器人领域的下一代模型。然而，尽管现有的端到端 VLA 系统利用了强大的预训练视觉-语言模型 (VLM)，但在微调过程中，由于模型需要适应特定的机器人任务，它们往往会丢失关键功能。我们认为，一个可泛化的 VLA 模型应该保留并扩展 VLM 的核心能力：1）开放世界具身推理——VLA 应该继承 VLM 的知识，即识别 VLM 能够识别的任何事物，能够解决数学问题，并具备视觉空间智能；2）推理跟随——有效地将开放世界推理转化为机器人可执行的步骤。本文推出ChatVLA-2，通过端到端利用预训练视觉语言模型所获得的先天推理和理解能力，赋予视觉-语言-动作 (VLA) 模型执行各种任务的能力。核心贡献是在预训练的视觉语言主干之上集成了一个dynamic Mixture-of-Experts (MoE)模块。该模块可以有效地管理不同的任务需求，其中一些专家共识共享普遍的多模态特征，而其他专家则专注于特定任务的表征。此外，提出了一种两阶段训练策略：首先，引导 VLA 模型建立预训练多模态知识与机器人动作之间的联系；随后，引入推理跟踪阶段，使模型能够理解推理输出并有效地将其转化为相应的动作。

arXiv, 2025-05-28T02:48:42Z. DOI: 10.48550/arXiv.2505.21906

Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge

翻译

Zhongyi Zhou, Yichen Zhu, Junjie Wen, Chaomin Shen, Yi Xu

Abstract:

Vision-language-action (VLA) models have emerged as the next generation ofmodels in robotics. However, despite leveraging powerful pre-trainedVision-Language Models (VLMs), existing end-to-end VLA systems often lose keycapabilities during fine-tuning as the model adapts to specific robotic tasks.We argue that a generalizable VLA model should retain and expand upon the VLM'score competencies: 1) Open-world embodied reasoning - the VLA should inheritthe knowledge from VLM, i.e., recognize anything that the VLM can recognize,capable of solving math problems, possessing visual-spatial intelligence, 2)Reasoning following - effectively translating the open-world reasoning intoactionable steps for the robot. In this work, we introduce ChatVLA-2, a novelmixture-of-expert VLA model coupled with a specialized three-stage trainingpipeline designed to preserve the VLM's original strengths while enablingactionable reasoning. To validate our approach, we design a math-matching taskwherein a robot interprets math problems written on a whiteboard and pickscorresponding number cards from a table to solve equations. Remarkably, ourmethod exhibits exceptional mathematical reasoning and OCR capabilities,despite these abilities not being explicitly trained within the VLA.Furthermore, we demonstrate that the VLA possesses strong spatial reasoningskills, enabling it to interpret novel directional instructions involvingpreviously unseen objects. Overall, our method showcases reasoning andcomprehension abilities that significantly surpass state-of-the-art imitationlearning methods such as OpenVLA, DexVLA, and pi-zero. This work represents asubstantial advancement toward developing truly generalizable roboticfoundation models endowed with robust reasoning capacities.

翻译

Related Links:

https://doi.org/10.48550/arXiv.2505.21906