Video4DGen: Enhancing Video and 4D Generation through Mutual Optimization

Abstract

We present Video4DGen, a novel framework that excels in generating 4D representations from single or multiple generated videos as well as generating 4D-guided videos. This framework is pivotal for creating high-fidelity virtual contents that maintain both spatial and temporal coherence. The 4D outputs generated by Video4DGen are represented using our proposed Dynamic Gaussian Surfels (DGS), which optimizes time-varying warping functions to transform Gaussian surfels (surface elements) from a static state to a dynamically warped state. We design warped-state geometric regularization and refinements on Gaussian surfels, to preserve the structural integrity and fine-grained appearance details, respectively. Additionally, in order to perform 4D generation from multiple videos and effectively capture representation across spatial, temporal, and pose dimensions, we design multi-video alignment, root pose optimization, and pose-guided frame sampling strategies. The leveraging of continuous warping fields also enables a precise depiction of pose, motion, and deformation over per-video frames. Further, to improve the overall fidelity from the observation of all camera poses, Video4DGen performs novel-view video generation guided by the 4D content, with the proposed confidence-filtered DGS to enhance the quality of generated sequences. In summary, Video4DGen yields dynamic 4D generation with the ability to handle different subject movements, while preserving details in both geometry and appearance. The framework also generates 4D-guided videos with high spatial and temporal coherence.

4D Generation from Single or Multiple Generated Videos

4D Novel View Synthesis

Multi-camera Video Generation from 4D Guidance

Single input image

4D-guided multi-camera generated video

Single input image

4D-guided multi-camera generated video

360° Video Generation (Static) from 4D Guidance