SVG

SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix
[Peng Dai], [Feitong Tan*], [Qiangeng Xu*], [David Futschik], [Ruofei Du], [Sean Fanello], [Xiaojuan Qi], [Yinda Zhang]
Paper, Project_page

This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.

Environment

conda create -n svg python=3.8
conda activate svg

# we use torch 2.4, other versions still work
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu118

pip install -r requirements.txt

Data preprocessing

Please download datasets used in this paper. The layout looks like this:

SVG
├── Videos
    │──An_astronaut_in_full_space_suit_riding_a_horse_320x576 
    │   │──images
    │   │   |──00000.jpg
    │   │   |──00001.jpg
    │──Obama_is_speaking_320x576 
    │   │──images
    │   │   |──00000.jpg
    │   │   |──00001.jpg

Depth prediction. Predict depth for each frame. You need to run under depthanything_v1 environment.

cd third_party/Depth-Anything/
python svg_run.py --encoder vitl --img-path ./Videos/A_teddybear_dancing_in_the_snow_320x576/images/ --outdir ./Videos/A_teddybear_dancing_in_the_snow_320x576/depths/

Depth stablization. We stablize the depth changes along the time axis. You need to run under RAFT environment and install guidedfilter running pip install opencv-contrib-python.

cd third_party/RAFT/
python svg_enhance_depth.py --path=./Videos/A_teddybear_dancing_in_the_snow_320x576/images/  --depth_path=./Videos/A_teddybear_dancing_in_the_snow_320x576/depths/ --out_dir=./Videos/A_teddybear_dancing_in_the_snow_320x576/depths_refined/

Run video generation

Construct frame matrix

python create_frame_matrix.py --img_path=./Videos/A_teddybear_dancing_in_the_snow_320x576/ --depth_path=./Videos/A_teddybear_dancing_in_the_snow_320x576/ --output_root=./Videos/frame_matrix/A_teddybear_dancing_in_the_snow/  --num_frames 16 --width 576 --height 320 --max_baseline 0.08 --num_training_views 8  --left2right

Stereoscopic video generation

python svg.py --prompt 'A teddybear dancing in the snow' --init_image ./Videos/frame_matrix/A_teddybear_dancing_in_the_snow --output_root ./results/A_teddybear_dancing_in_the_snow/  --num_frames 16 --width 576 --height 320  --latent_w 72 --latent_h 40

Multi-view video generation

python svg.py --prompt 'A teddybear dancing in the snow' --init_image ./Videos/frame_matrix/A_teddybear_dancing_in_the_snow --output_root ./results/A_teddybear_dancing_in_the_snow/ --num_frames 16 --width 576 --height 320  --latent_w 72 --latent_h 40 --frame_matrix_end 0 --fix_last_view_begin 600

More useful commands, please refer to 'run.sh'.

Tips

Depth range

We use estimated relative depth and normalize them into 1~10m. Note that the closer the foreground content is, the more disoccluded areas need to be inpainted, making the task more challenging. You can modify the depth values to achieve stereoscopic effects you prefer. Alternatively, metric depth is a good choice if avaiable.

Camera setting

We use two parallel cameras (FOV: 800) with a 7cm distance between them (objects will be in front of the screen). According to your preference, you can change the FOV or toe in the two cameras (two cameras have a coverage point).

The number of cameras between left and right views

We place 8 cameras between left and right views. You can reduce the number of cameras to expedite the generation process when the case is easy or the disoccluded regions are small. In practice, we found 4 cameras also produce competitive results.

Video generation model

The current implementation is based on a text2video model (i.e., zeroscope) and therefore requires text prompts as inputs, which have influences on the final results. Moreover, applying frame matrix denoising to other advanced video generation models with superior backbones (e.g., DIT) is worth trying.

Citation

Please consider staring this repository and citing the following paper if you feel this repository useful.

@inproceedings{
dai2025svg,
title={{SVG}: 3D Stereoscopic Video Generation via Denoising Frame Matrix},
author={Peng Dai and Feitong Tan and Qiangeng Xu and David Futschik and Ruofei Du and Sean Fanello and XIAOJUAN QI and Yinda Zhang},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=sx2jXZuhIx}
}

Contact

If you have any questions, you can email me (daipengwa@gmail.com).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SVG

Environment

Data preprocessing

Run video generation

Tips

Depth range

Camera setting

The number of cameras between left and right views

Video generation model

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Videos		Videos
third_party		third_party
LICENSE		LICENSE
README.md		README.md
contributing.md		contributing.md
create_frame_matrix.py		create_frame_matrix.py
requirements.txt		requirements.txt
run.sh		run.sh
svg.py		svg.py

Folders and files

Latest commit

History

Repository files navigation

SVG

Environment

Data preprocessing

Run video generation

Tips

Depth range

Camera setting

The number of cameras between left and right views

Video generation model

Citation

Contact

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages