Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

Abstract

Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot video motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling. To disentangle the spatial and temporal information during training, we introduce a novel concept of appearance absorbers that detach the original appearance from the reference video prior to motion learning. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions to various downstream tasks such as custom video generation and editing, video appearance customization and multiple motion combination.

Applications

DDIM Inverted Latent Input

Customize-A-Video can cooperate with additional control signals to produce precise video editing per frame.

Video Appearance Customization

Customize-A-Video are compatible with image customization modules to customize videos both spatially and temporally.

Multiple Reference Motion Combination

Multiple Customize-A-Video modules can collaborate to generate videos with multiple target motions combined.

Third-Party Appearance Absorbers

Customize-A-Video features a staged training pipeline and thus its Temporal LoRA can be trained with loading third-party appearance absorbers tuned on other image data if they share similar appearances.

Concurrent Work

There's a lot of excellent work that was introduced around the same time as ours.

Customizing Motion in Text-to-Video Diffusion Models finetunes temporal layers in place with special tokens in Dreambooth way.

VMC finetunes temporal layers in place too with a replaced frame residual vector loss.

MotionCrafter employs two parallel UNets and tune one of them with an additional appearance normalization loss.

DreamVideo adds adapters over temporal attentions conditioned on a frame to decompose motion from appearance.

MotionDirector applies dual-path LoRAs on spatial and temporal attentions and trains them jointly with appearance-debiased temporal losses.

BibTeX


    @inproceedings{ren2024customize,
      title={Customize-a-video: One-shot motion customization of text-to-video diffusion models},
      author={Ren, Yixuan and Zhou, Yang and Yang, Jimei and Shi, Jing and Liu, Difan and Liu, Feng and Kwon, Mingi and Shrivastava, Abhinav},
      booktitle={European Conference on Computer Vision},
      pages={332--349},
      year={2024},
      organization={Springer}
    }

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

ECCV 2024

Customize-A-Video customizes a pre-trained video diffusion model with the new motion concept from a single reference video, and transfers it to new subjects and scenes with both motion fidelity and motion diversity.

Generation Results