EgoEngine
EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations
Abstract

Dexterous manipulation is limited by the cost of collecting large-scale robot demonstrations. Egocentric human videos offer a scalable source of diverse manipulation behaviors, but directly using them for robot learning requires bridging two gaps: the visual gap between human and robot observations, and the action gap between human motion and robot-executable action. We propose EgoEngine, a scalable framework for transforming egocentric human manipulation videos into high-fidelity robot data. Given an egocentric RGB video, EgoEngine produces: (i) a high-fidelity robot observation video replacing human with robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots show that EgoEngine enables scalable conversion of human videos into robot data and, to our knowledge, demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations.

EgoEngine Teaser
Method Overview
EgoEngine pipeline overview

EgoEngine is a Scalable data engine that converts egocentric human videos into robot demonstrations. Given an egocentric human video, EgoEngine constructs a digital twin and jointly produces (1) a high-fidelity, temporally consistent robot observation video and (2) executable action trajectories aligned with the video. The generated demonstrations serve as training data for downstream visuomotor policies, enabling Zero-shot execution.

Visual Generation

We present qualitative results for robot observation generation, followed by dataset-scale examples from TACO and Aria.

Qualitative Comparison

From left to right: human input, EgoMimic, VACE (WAN2.1), Masquerade, and EgoEngine.

Dataset-Scale Visualizations

Each task shows the input egocentric video, the corresponding simulation rollout, and the observation generated by EgoEngine.

Real-World Observation Transfer

Each task contains paired real-world examples comparing the human foreground observations with the robot foreground observations generated by EgoEngine.
Some examples were collected with an earlier robot configuration and Aria Gen 1 glasses.

Action Generation

We present executable action trajectories produced by EgoEngine on both TACO and Aria tasks.

Human video
Refined trajectory replay
Simulation

EgoEngine maps an egocentric human video to a digital-twin rollout and refines it into an executable real-robot trajectory for long-horizon dexterous manipulation.

For safety, we add a z-axis offset and place a sponge beneath the knife to avoid collision with the table.

Trajectory Refinement

The top row shows TACO tasks and the bottom row shows Aria tasks; the two tracks can be browsed independently.

Policy Evaluation

We present real-world policy rollouts for tasks learned from EgoEngine-generated demonstrations.

Real-World Rollouts

Teleop 3x
EgoEngine 1x

Rollout Performance

Representative quantitative comparison. Additional results are provided in the paper.

Policy success rate comparison on four Aria tasks
Policy success rate (SR) on four Aria tasks. EgoEngine enables zero-shot real-robot policy learning from egocentric human videos and substantially improves performance over direct retargeting from human videos and prior baselines such as Phantom, approaching the performance of real-robot teleoperation on several tasks.
Ablation of visual and action generation for zero-shot policy learning
Ablation results show that executable action generation provides the dominant gain, while visual generation offers an additional improvement. Together, the two branches yield the strongest zero-shot policy learning performance.