来自用户 Kunji 的文献。
当前共找到 1 篇文献分享。
1.
Kunji
(2025-02-28 23:59):
#paper, https://arxiv.org/pdf/2410.05273, HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers, VLA依赖于数十亿参数的VLM,虽然具有强大的泛化能力,但计算成本高、推理速度慢,限制了其在动态任务中的应用。为了解决这些局限性,文章提出了HiRT框架(Hierarchical Robot Transformer framework),借鉴了人类认知的双过程理论,采用双系统架构和异步操作机制,实现频率与性能之间的平衡。在模拟和真实环境中的实验结果表明,HiRT取得了显著的改进。在静态任务中,控制频率提高了一倍,并实现了相当的成功率。此外,在之前VLA模型难以应对的真实世界动态操作任务中,HiRT将成功率从48%提高到了75%。
arXiv,
2024-09-12T09:18:09Z.
DOI: 10.48550/arXiv.2410.05273
Jianke Zhang,
Yanjiang Guo,
Xiaoyu Chen,
Yen-Jen Wang,
Yucheng Hu,
Chengming Shi,
Jianyu Chen
Abstract:
Large Vision-Language-Action (VLA) models, leveraging powerful pre trained
Vision-Language Models (VLMs) backends, have shown promise in robotic control
due to their impressive generalization ability. However, the success comes at a
cost. Their reliance on VLM backends with billions of parameters leads to high
computational costs and inference latency, limiting the testing scenarios to
mainly quasi-static tasks and hindering performance in dynamic tasks requiring
rapid interactions. To address these limitations, this paper proposes HiRT, a
Hierarchical Robot Transformer f… >>>
Vision-Language Models (VLMs) backends, have shown promise in robotic control
due to their impressive generalization ability. However, the success comes at a
cost. Their reliance on VLM backends with billions of parameters leads to high
computational costs and inference latency, limiting the testing scenarios to
mainly quasi-static tasks and hindering performance in dynamic tasks requiring
rapid interactions. To address these limitations, this paper proposes HiRT, a
Hierarchical Robot Transformer f… >>>
Large Vision-Language-Action (VLA) models, leveraging powerful pre trained<br>Vision-Language Models (VLMs) backends, have shown promise in robotic control<br>due to their impressive generalization ability. However, the success comes at a<br>cost. Their reliance on VLM backends with billions of parameters leads to high<br>computational costs and inference latency, limiting the testing scenarios to<br>mainly quasi-static tasks and hindering performance in dynamic tasks requiring<br>rapid interactions. To address these limitations, this paper proposes HiRT, a<br>Hierarchical Robot Transformer framework that enables flexible frequency and<br>performance trade-off. HiRT keeps VLMs running at low frequencies to capture<br>temporarily invariant features while enabling real-time interaction through a<br>high-frequency vision-based policy guided by the slowly updated features.<br>Experiment results in both simulation and real-world settings demonstrate<br>significant improvements over baseline methods. Empirically, in static tasks,<br>we double the control frequency and achieve comparable success rates.<br>Additionally, on novel real-world dynamic ma nipulation tasks which are<br>challenging for previous VLA models, HiRT improves the success rate from 48% to<br>75%. <<<