EgoEngine
EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations
EgoEngine Teaser
Abstract

Dexterous manipulation is bottlenecked by the cost of collecting large-scale, physically feasible demonstrations. Egocentric human videos offer abundant manipulation behaviors at scale, but transforming these human videos into robot-executable demonstrations is challenging due to embodiment appearance difference and feasibility constraints. We propose EgoEngine, a scalable framework for human video-conditioned robot data generation simultanously address human-to-robot visual gap and action gap. Given an egocentric RGB manipulation video, EgoEngine generates paired robot demonstrations: (i) a high-fidelity robot observation video aligned with the generated actions and scene context and (ii) a executable action sequence computed via trajectory optimization that tracks task intent while enforcing kinematic, dynamic, and contact constraints. Our experiments show that EgoEngine enabling robust large-scale dataset construction from human videos, and achieve zero-shot downstream policy learning without any real robot demonstration.

Visual Branch

We first show the visual fidelity of generated observations, followed by dataset-scale visualizations from TACO and Aria.

Visual Comparison

Human, EgoMimic, VACE (WAN2.1), Masquerade, and EgoEngine.

Visual Generation

Scroll to move between tasks. Each task shows Human Videos -> Simulation -> Inpainted.

Real-World Visualization

Slide horizontally to move between tasks. Each slide contains one full task.
Left: original (human foreground). Right: inpainted (robot foreground).
Note: The robot has since been upgraded, and some demos here were collected with Aria Gen 1 glasses.

Action Branch

We next present executable action outputs produced by EgoEngine on both TACO and Aria tasks.

Human video
Refined action replay
Simulation

EgoEngine converts an egocentric human video into a digital-twin rollout and refines it into an executable real-robot trajectory for long-horizon dexterous tool use.

Safety note: we add a z-axis offset and place a sponge under the knife to avoid collision with the table.

Refined Action

TACO on top and Aria below. Each track scrolls independently.

Real World

We finally show real-world task results, including paired original human observations and generated robot observations.

Policy rollout

teleop