Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapting it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling from the reference videos. To disentangle the spatial and temporal information during the training pipeline, we introduce a novel concept of appearance absorbers that detach the original appearance from the single reference video prior to motion learning. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions of our method to various downstream tasks such as custom video generation and editing, video appearance customization and multiple motion combination.
Customize-A-Video can cooperate with additional control signals to produce precise video editing per frame.
Customize-A-Video are compatible with image customization modules to customize videos both spatially and temporally.
Multiple Customize-A-Video modules can collaborate to generate videos with multiple target motions combined.
Customize-A-Video features a staged training pipeline and thus its Temporal LoRA can be trained with loading third-party appearance absorbers tuned on other image data if they share similar appearances.
There's a lot of excellent work that was introduced around the same time as ours.
Customizing Motion in Text-to-Video Diffusion Models finetunes temporal layers in place with special tokens in Dreambooth way.
VMC finetunes temporal layers in place too with a replaced frame residual vector loss.
MotionCrafter employs two parallel UNets and tune one of them with an additional appearance normalization loss.
DreamVideo adds specially designed adapters over temporal attentions conditioned on one frame to decompose pure motion from its appearance.
MotionDirector applies dual-path LoRAs on spatial and temporal attentions and trains them jointly with appearance-debiased temporal losses.