Diffuse Implicit Neural Representation for Video Synthesis

We present NeRV-Diffusion, an implicit latent diffusion model that generates neural network weights. The generated weights can be rearranged into a convolutional neural network, which forms an implicit neural representation (INR), and decodes into videos with time indices as the input. Our framework consists of two stages: 1) An hypernetwork-based encoder that compresses raw videos from pixel space to parameter space, where the output parameters serve as the weights of INRs to decode. They are trained following the paradigm of a variational autoencoder (VAE). 2) A diffusion transformer that performs the denoising process on the encoded parametric latent, mapping random noise to neural network weights. Unlike traditional latent video diffusion models that works on frame-wise feature maps, NeRV-Diffusion generates video as a unified neural network, enabling efficient decoding and flexible temporal interpolation. To achieve Gaussian-distributed INR weights with high expressiveness, we reuse the VAE bottleneck latent across all NeRV layers, and reform its weight assignment and input coordinates. Moreover, we introduce an inverse SNR loss weight and implement scheduled sampling for effective training of the implicit diffusion model. Our model reaches superior video generation quality compared to previous INR-based generative models, with a compact INR size and a smooth interpolatable INR weight space.

NeRV-Diffusion: Diffuse Implicit Neural Representation for Video Synthesis

tl;dr: NeRV-Diffusion synthesizes videos via generating implicit neural representation weights from Gaussian noise.

Abstract

Unconditional Generation

Sky Time-lapse

TaiChi-HD

FaceForensics

Class-conditioned Generation

UCF-101