Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

ECCV 2024

1University of Maryland 2Adobe Research 3Yonsei University

Customize-A-Video customizes a pre-trained video diffusion model with the new motion concept from a single reference video, and transfers it to new subjects and scenes with both motion fidelity and motion diversity.

Generation Results

Abstract

Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot video motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling. To disentangle the spatial and temporal information during training, we introduce a novel concept of appearance absorbers that detach the original appearance from the reference video prior to motion learning. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions to various downstream tasks such as custom video generation and editing, video appearance customization and multiple motion combination.

Comparison Results

Applications

DDIM Inverted Latent Input

Customize-A-Video can cooperate with additional control signals to produce precise video editing per frame.

Video Appearance Customization

Customize-A-Video are compatible with image customization modules to customize videos both spatially and temporally.

Multiple Reference Motion Combination

Multiple Customize-A-Video modules can collaborate to generate videos with multiple target motions combined.

Third-Party Appearance Absorbers

Customize-A-Video features a staged training pipeline and thus its Temporal LoRA can be trained with loading third-party appearance absorbers tuned on other image data if they share similar appearances.

Concurrent Work

There's a lot of excellent work that was introduced around the same time as ours.

Customizing Motion in Text-to-Video Diffusion Models finetunes temporal layers in place with special tokens in Dreambooth way.

VMC finetunes temporal layers in place too with a replaced frame residual vector loss.

MotionCrafter employs two parallel UNets and tune one of them with an additional appearance normalization loss.

DreamVideo adds adapters over temporal attentions conditioned on a frame to decompose motion from appearance.

MotionDirector applies dual-path LoRAs on spatial and temporal attentions and trains them jointly with appearance-debiased temporal losses.

BibTeX

@article{ren2024customize,
  title={Customize-a-video: One-shot motion customization of text-to-video diffusion models},
  author={Ren, Yixuan and Zhou, Yang and Yang, Jimei and Shi, Jing and Liu, Difan and Liu, Feng and Kwon, Mingi and Shrivastava, Abhinav},
  journal={arXiv preprint arXiv:2402.14780},
  year={2024}
}