Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapting it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling from the reference videos. To disentangle the spatial and temporal information during the training pipeline, we introduce a novel concept of appearance absorbers that detach the original appearance from the single reference video prior to motion learning. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions of our method to various downstream tasks such as custom video generation and editing, video appearance customization and multiple motion combination.
Customize-A-Video can cooperate with additional control signals to produce precise video editing per frame.
Customize-A-Video are compatible with image customization modules to customize videos both spatially and temporally.
Multiple Customize-A-Video modules can collaborate to generate videos with multiple target motions combined.
Customize-A-Video features a staged training pipeline and thus its Temporal LoRA can be trained with loading third-party appearance absorbers tuned on other image data if they share similar appearances.
There's a lot of excellent work that was introduced around the same time as ours.
Customizing Motion in Text-to-Video Diffusion Models finetunes temporal layers in place with special tokens in Dreambooth way.
VMC finetunes temporal layers in place too with a replaced frame residual vector loss.
MotionCrafter employs two parallel UNets and tune one of them with an additional appearance normalization loss.
DreamVideo adds adapters over temporal attentions conditioned on a frame to decompose motion from appearance.
MotionDirector applies dual-path LoRAs on spatial and temporal attentions and trains them jointly with appearance-debiased temporal losses.
@article{ren2024customize,
title={Customize-a-video: One-shot motion customization of text-to-video diffusion models},
author={Ren, Yixuan and Zhou, Yang and Yang, Jimei and Shi, Jing and Liu, Difan and Liu, Feng and Kwon, Mingi and Shrivastava, Abhinav},
journal={arXiv preprint arXiv:2402.14780},
year={2024}
}