SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix
[Peng Dai], [Feitong Tan*], [Qiangeng Xu*], [David Futschik], [Ruofei Du], [Sean Fanello], [Xiaojuan Qi], [Yinda Zhang]
Paper, Project_page
This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.
conda create -n svg python=3.8
conda activate svg
# we use torch 2.4, other versions still work
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
- Please download datasets used in this paper. The layout looks like this:
SVG
├── Videos
│──An_astronaut_in_full_space_suit_riding_a_horse_320x576
│ │──images
│ │ |──00000.jpg
│ │ |──00001.jpg
│──Obama_is_speaking_320x576
│ │──images
│ │ |──00000.jpg
│ │ |──00001.jpg
- Depth prediction. Predict depth for each frame. You need to run under depthanything_v1 environment.
cd third_party/Depth-Anything/
python svg_run.py --encoder vitl --img-path ./Videos/A_teddybear_dancing_in_the_snow_320x576/images/ --outdir ./Videos/A_teddybear_dancing_in_the_snow_320x576/depths/- Depth stablization. We stablize the depth changes along the time axis. You need to run under RAFT environment and install guidedfilter running
pip install opencv-contrib-python.
cd third_party/RAFT/
python svg_enhance_depth.py --path=./Videos/A_teddybear_dancing_in_the_snow_320x576/images/ --depth_path=./Videos/A_teddybear_dancing_in_the_snow_320x576/depths/ --out_dir=./Videos/A_teddybear_dancing_in_the_snow_320x576/depths_refined/- Construct frame matrix
python create_frame_matrix.py --img_path=./Videos/A_teddybear_dancing_in_the_snow_320x576/ --depth_path=./Videos/A_teddybear_dancing_in_the_snow_320x576/ --output_root=./Videos/frame_matrix/A_teddybear_dancing_in_the_snow/ --num_frames 16 --width 576 --height 320 --max_baseline 0.08 --num_training_views 8 --left2right- Stereoscopic video generation
python svg.py --prompt 'A teddybear dancing in the snow' --init_image ./Videos/frame_matrix/A_teddybear_dancing_in_the_snow --output_root ./results/A_teddybear_dancing_in_the_snow/ --num_frames 16 --width 576 --height 320 --latent_w 72 --latent_h 40- Multi-view video generation
python svg.py --prompt 'A teddybear dancing in the snow' --init_image ./Videos/frame_matrix/A_teddybear_dancing_in_the_snow --output_root ./results/A_teddybear_dancing_in_the_snow/ --num_frames 16 --width 576 --height 320 --latent_w 72 --latent_h 40 --frame_matrix_end 0 --fix_last_view_begin 600More useful commands, please refer to 'run.sh'.
We use estimated relative depth and normalize them into 1~10m. Note that the closer the foreground content is, the more disoccluded areas need to be inpainted, making the task more challenging. You can modify the depth values to achieve stereoscopic effects you prefer. Alternatively, metric depth is a good choice if avaiable.
We use two parallel cameras (FOV: 800) with a 7cm distance between them (objects will be in front of the screen). According to your preference, you can change the FOV or toe in the two cameras (two cameras have a coverage point).
We place 8 cameras between left and right views. You can reduce the number of cameras to expedite the generation process when the case is easy or the disoccluded regions are small. In practice, we found 4 cameras also produce competitive results.
The current implementation is based on a text2video model (i.e., zeroscope) and therefore requires text prompts as inputs, which have influences on the final results. Moreover, applying frame matrix denoising to other advanced video generation models with superior backbones (e.g., DIT) is worth trying.
Please consider staring this repository and citing the following paper if you feel this repository useful.
@inproceedings{
dai2025svg,
title={{SVG}: 3D Stereoscopic Video Generation via Denoising Frame Matrix},
author={Peng Dai and Feitong Tan and Qiangeng Xu and David Futschik and Ruofei Du and Sean Fanello and XIAOJUAN QI and Yinda Zhang},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=sx2jXZuhIx}
}
If you have any questions, you can email me (daipengwa@gmail.com).