符毓 (2025-12-31 17:21):
#paper doi: 10.48550/arXiv.2512.16907, 2025, Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos Meta推出了 EgoMAN 数据集,这是一个大规模的以第一视角的基准数据集,用于6DoF手部轨迹预测。以及对应的预测模型,这是一个模块化的推理到运动框架,它通过轨迹标记接口和渐进式训练,将高层意图与基于物理的 6DoF 轨迹对齐。实验表明,与仅基于运动和基于VLM基线模型相比,EgoMAN 模型取得了显著优势:流匹配能够生成更平滑、更稳定的轨迹;VLM 驱动的推理提高了语义对齐和对新场景及意图的泛化能力;轨迹标记接口实现了高效的推理,将基于意图的阶段感知推理与精确的底层运动生成相结合。总而言之,EgoMAN 为实现上下文动作预测提供了一个切实可行的步骤,支持机器人操作、语言感知运动合成和意图感知辅助系统等应用。 之前数据集的一个主要瓶颈在于缺乏大规模、高质量的3D轨迹数据。部分数据集提供了准确的标注,但多样性有限;而大规模的以自我为中心的视频数据集包含丰富的真实世界交互,但轨迹噪声较大、目标导向性较弱,且缺乏时间结构。关键在于,它们缺乏明确的交互阶段,例如接近和操作,而这些阶段对于将有目的的运动与背景区分开来,以及将轨迹与意图联系起来至关重要。基于此类原始视频训练的模型通常泛化能力较差,因为缺乏意图、空间关系和运动动态之间的联系。
arXiv, 2025-12-18T18:59:01Z. DOI: 10.48550/arXiv.2512.16907
Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos
翻译
Abstract:
Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
翻译
回到顶部