刘馨云 (2025-05-31 21:32):
#paper https://arxiv.org/pdf/2505.20290 人类通过观察他人来学习新任务。受到这一点启发,我们提出了 EgoZero 框架,它可以从人类佩戴智能眼镜拍摄的第三人称视频中学习闭环机器人策略。智能眼镜能够捕捉人类交互的丰富多模态第一人称视角:RGB 视频记录周围场景,IMU(惯性测量单元)提供头部运动信息,麦克风则记录对话和环境声音。我们的方法仅通过观察这些第一人称视频来学习如何行动,无需任何机器人演示。当给定一个人类完成任务的视频时,EgoZero 能预测一系列中间目标和语言子目标,并据此在真实机器人上以闭环方式执行任务。EgoZero 将人类观察压缩为与机器人形态无关的状态表示,这些表示可用于决策和闭环控制。所学策略在不同的机器人形态、环境和任务之间表现出良好的泛化能力。我们在真实的 Franka Panda 机械臂上进行了验证,结果表明 EgoZero 能以 70% 的零样本成功率完成多种具有挑战性的操控任务,每项任务仅需 20 分钟的数据采集时间。
arXiv, 2025-05-26T17:59:17Z. DOI: 10.48550/arXiv.2505.20290
EgoZero: Robot Learning from Smart Glasses
Vincent Liu, Ademi Adeniji, Haotian Zhan, Raunaq Bhirangi, Pieter Abbeel, Lerrel Pinto
Abstract:
Despite recent progress in general purpose robotics, robot policies still lag<br>far behind basic human capabilities in the real world. Humans interact<br>constantly with the physical world, yet this rich data resource remains largely<br>untapped in robot learning. We propose EgoZero, a minimal system that learns<br>robust manipulation policies from human demonstrations captured with Project<br>Aria smart glasses, $\textbf{and zero robot data}$. EgoZero enables: (1)<br>extraction of complete, robot-executable actions from in-the-wild, egocentric,<br>human demonstrations, (2) compression of human visual observations into<br>morphology-agnostic state representations, and (3) closed-loop policy learning<br>that generalizes morphologically, spatially, and semantically. We deploy<br>EgoZero policies on a gripper Franka Panda robot and demonstrate zero-shot<br>transfer with 70% success rate over 7 manipulation tasks and only 20 minutes of<br>data collection per task. Our results suggest that in-the-wild human data can<br>serve as a scalable foundation for real-world robot learning - paving the way<br>toward a future of abundant, diverse, and naturalistic training data for<br>robots. Code and videos are available at https://egozero-robot.github.io.
回到顶部