Kunji
(2025-02-28 23:59):
#paper, https://arxiv.org/pdf/2410.05273, HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers, VLA依赖于数十亿参数的VLM,虽然具有强大的泛化能力,但计算成本高、推理速度慢,限制了其在动态任务中的应用。为了解决这些局限性,文章提出了HiRT框架(Hierarchical Robot Transformer framework),借鉴了人类认知的双过程理论,采用双系统架构和异步操作机制,实现频率与性能之间的平衡。在模拟和真实环境中的实验结果表明,HiRT取得了显著的改进。在静态任务中,控制频率提高了一倍,并实现了相当的成功率。此外,在之前VLA模型难以应对的真实世界动态操作任务中,HiRT将成功率从48%提高到了75%。
arXiv,
2024-09-12T09:18:09Z.
DOI: 10.48550/arXiv.2410.05273
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
翻译
Abstract:
Large Vision-Language-Action (VLA) models, leveraging powerful pre trainedVision-Language Models (VLMs) backends, have shown promise in robotic controldue to their impressive generalization ability. However, the success comes at acost. Their reliance on VLM backends with billions of parameters leads to highcomputational costs and inference latency, limiting the testing scenarios tomainly quasi-static tasks and hindering performance in dynamic tasks requiringrapid interactions. To address these limitations, this paper proposes HiRT, aHierarchical Robot Transformer framework that enables flexible frequency andperformance trade-off. HiRT keeps VLMs running at low frequencies to capturetemporarily invariant features while enabling real-time interaction through ahigh-frequency vision-based policy guided by the slowly updated features.Experiment results in both simulation and real-world settings demonstratesignificant improvements over baseline methods. Empirically, in static tasks,we double the control frequency and achieve comparable success rates.Additionally, on novel real-world dynamic ma nipulation tasks which arechallenging for previous VLA models, HiRT improves the success rate from 48% to75%.
翻译
Related Links: