Learning to Act from Actionless Videos
through Dense Correspondences

Po-Chen Ko
Jiayuan Mao
Yilun Du
Shao-Hua Sun
Joshua B. Tenenbaum
Paper arXiv Code


Framework Overview




Real-World Franka Emika Panda Arm with Bridge Dataset


We train our video generation model on the Bridge data (Ebert et al., 2022), and perform evaluation on a real-world Franka Emika Panda tabletop manipulation environment.

Synthesized Videos


Robot Executions

Task: put apple in plate
Task: put banana in plate
Task: put peach in blue bowl


Meta-World


We train our video generation model on 165 videos of 11 tasks. We evaluate on robot manipulation tasks in Meta-World (Yu et al., 2019) simulated benchmark.

Synthesized Videos


Robot Executions

Task: Assembly
Task: Door Open
Task: Hammer
Task: Shelf Place


iTHOR


We train our video generation model on 240 videos of 12 target objects. We evaluate on object navigation tasks in iTHOR (Kolve et al., 2017) simulated benchmark.

Synthesized Videos


Robot Navigation

Task: Pillow
Task: Soap Bar
Task: Television
Task: Toaster


Cross-Embodiment Learning (Visual Pusher)


We train our video generation model on ~200 actionless human pushing videos and evaluate in Visual Pusher (Schmeckpeper et al., 2021, Zakka et al., 2022) robot environment.

Failed executions


input image
video plan
execution


input image
video plan
execution


Successful executions


input image
video plan
execution


input image
video plan
execution



Zero-Shot Generalization on Real-World Scene with Bridge Model


We show that our video diffusion model trained on Bridge data (mostly toy kitchen) already can generalize to complex real-world kitchen scenarios. Note that the videos are blurry since the original video resolution is low (48x64).

Task: pick up banana
generated video
Task: put lid on pot
generated video


Task: put pot in sink
generated video


Extended Analysis and Ablation Studies


Comparison of First-Frame Conditioning Strategy

We compare our proposed first-frame conditioning strategy (cat_c) with the naive frame-wise concatenate strategy (cat_t). Our method (cat_c) consistently outperforms the frame-wise concatenating baseline (cat_t) when training on the Bridge dataset. Below we provide some qualitative examples of synthesized videos with 40k training steps.


Improving Inference Efficiency with
Denoising Diffusion Implicit Models

This section investigates the possibility of accelerating the sampling process using Denoising Diffusion Implicit Models (DDIM; Song et al., 2021). To this end, instead of iterative denoising 100 steps, as reported in the main paper, we have experimented with different numbers of denoising steps (e.g., 25, 10, 5, 3) using DDIM. We found that we can generate high-fidelity videos with only 1/10 of the samplimg steps (10 steps) with DDIM, allowing for tackling running time-critical tasks. We present the synthesized videos with 25, 10, 5, 3 denoising steps as follows.

DDIM 25 steps: The quality of the synthesized videos are satisfactory depsite minor temporal inconsistency (gripper/object disappeared/duplicated) compared to our DDPM (100 steps) videos reported in previous section.


DDIM 10 steps: The quality of the synthesized videos is similar to those generated with 25 steps.


DDIM 5 steps: The temporal inconsistency issue is more severe with only 5 denoising steps.


DDIM 3 steps: The temporal inconsistency issue is more severe and some objects are blurry.


BibTex

                
@article{Ko2023Learning,
title={{Learning to Act from Actionless Videos through Dense Correspondences}},
author={Ko, Po-Chen and Mao, Jiayuan and Du, Yilun and Sun, Shao-Hua and Tenenbaum, Joshua B},
journal={arXiv:2310.08576},
year={2023},
}