Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

1University of Maryland 2Adobe Research 3Yonsei University

Customize-A-Video learns a new motion concept from a single reference video and transfers it to new subjects and scenes with both motion fidelity and motion diversity.

Generation Results

Abstract

Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapting it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling from the reference videos. To disentangle the spatial and temporal information during the training pipeline, we introduce a novel concept of appearance absorbers that detach the original appearance from the single reference video prior to motion learning. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions of our method to various downstream tasks such as custom video generation and editing, video appearance customization and multiple motion combination.

Comparison Results

Applications

DDIM Inverted Latent Input

Customize-A-Video can cooperate with additional control signals to produce precise video editing per frame.

Video Appearance Customization

Customize-A-Video are compatible with image customization modules to customize videos both spatially and temporally.

Multiple Reference Motion Combination

Multiple Customize-A-Video modules can collaborate to generate videos with multiple target motions combined.

Third-Party Appearance Absorbers

Customize-A-Video features a staged training pipeline and thus its Temporal LoRA can be trained with loading third-party appearance absorbers tuned on other image data if they share similar appearances.

Concurrent Work

There's a lot of excellent work that was introduced around the same time as ours.

Customizing Motion in Text-to-Video Diffusion Models finetunes temporal layers in place with special tokens in Dreambooth way.

VMC finetunes temporal layers in place too with a replaced frame residual vector loss.

MotionCrafter employs two parallel UNets and tune one of them with an additional appearance normalization loss.

DreamVideo adds specially designed adapters over temporal attentions conditioned on one frame to decompose pure motion from its appearance.

MotionDirector applies dual-path LoRAs on spatial and temporal attentions and trains them jointly with appearance-debiased temporal losses.