Euphonium is a novel framework for steering video flow matching via process reward gradient guided stochastic dynamics. It employs a Dual-Reward Group Relative Policy Optimization algorithm that combines latent process rewards for efficient credit assignment and pixel-level outcome rewards for visual fidelity, significantly accelerating training convergence.
- Release training scripts
- Release inference scripts
- Release reward model checkpoints
While online Reinforcement Learning has emerged as a crucial technique for aligning flow matching models with human preferences, current approaches are hindered by inefficient exploration during training rollouts. Relying on undirected stochasticity and sparse outcome rewards, these methods struggle to discover high-reward samples, resulting in data-inefficient and slow optimization. To address these limitations, we propose Euphonium, a novel framework that steers generation via process reward gradient guided dynamics. Our key insight is to formulate the sampling process as a theoretically principled Stochastic Differential Equation that explicitly incorporates the gradient of a Process Reward Model into the flow drift. This design enables dense, step-by-step steering toward high-reward regions, advancing beyond the unguided exploration in prior works, and theoretically encompasses existing sampling methods (e.g., Flow-GRPO, DanceGRPO) as special cases. We further derive a distillation objective that internalizes the guidance signal into the flow network, eliminating inference-time dependency on the reward model. We instantiate this framework with a Dual-Reward Group Relative Policy Optimization algorithm, combining latent process rewards for efficient credit assignment with pixel-level outcome rewards for final visual fidelity. Experiments on text-to-video generation show that Euphonium achieves better alignment compared to existing methods while accelerating training convergence by 1.66x.
git clone --recursive https://github.com/zerzerzerz/Euphonium.git
cd EuphoniumNote: The
--recursiveflag is required to properly initialize the submodules.
This repository contains the following submodules in third_party/:
| Submodule | Description | Repository |
|---|---|---|
| Latent_PRM | Latent-space Process Reward Model for training rollout reward gradient guidance | GitHub |
| SoliReward | Pixel-space Outcome Reward Model for visual fidelity | GitHub |
- CUDA 12.4 is required for optimal performance and compatibility.
# Run the environment setup script to handle all dependencies
bash scripts/env_setup.shThe following models are required:
-
HunyuanVideo: Main generation model
- Download from Hugging Face
-
Reward Models (Required):
- SoliReward (pixel-space ORM, e.g., InternVL3-1B) - Required
- Latent PRM (latent-space Process Reward Model) - Required
- VideoAlign reward model - Optional
The reward models are available at HuggingFace. To train Euphonium, you only need the physics deformity ORM and the latent PRM. You can download them using the following commands:
# Download Physics Deformity ORM huggingface-cli download Yukino271828/SoliReward --include "pixel_orm/physics-deformity-HPQA-InternVL3-1B/*" --local-dir checkpoints/SoliReward # Download Latent PRM huggingface-cli download Yukino271828/SoliReward --include "latent_prm/*" --local-dir checkpoints/SoliReward
Data preprocessing converts raw text prompts into precomputed text embeddings to accelerate training. The output index uses absolute paths for direct use across different working directories.
Input Example (prompts.txt):
A dog running in the park.
A cat sleeping on the couch.
A person playing basketball.
Output Structure:
output_dir/
βββ prompt_embed/ # Text embedding vectors (.pt files)
βββ prompt_attention_mask/ # Attention masks (.pt files)
βββ videos2caption.json # Data index file (contains absolute paths)
-
Prepare your prompt file: Create a text file with one prompt per line.
-
Configure and Run: Edit
scripts/preprocess_hunyuan_text_embeddings.shto setWORKDIR,MODEL_PATH, andPROMPT_PATH.bash scripts/preprocess_hunyuan_text_embeddings.sh
Edit scripts/train_grpo_hunyuan.sh to configure variables:
HUNYUAN_VIDEO_PATH: Path to the base model.TRL_MODEL_PATH: Path to your reward model.data_json_path: Path tovideos2caption.jsonfrom the preprocessing step.
Recommended: Use pssh (parallel-ssh) to launch training across multiple nodes.
-
Create a hostfile: List one IP per line.
192.168.1.1 192.168.1.2 -
Run on each node:
export hostfile=/path/to/hostfile bash scripts/train_grpo_hunyuan.shOr use
psshfor parallel launch:pssh -h hostfile -i "cd /path/to/Euphonium && bash scripts/train_grpo_hunyuan.sh"
- TensorBoard: Checkpoints and logs are saved in
${output_base_dir}/tensorboard/. - Logs: Detailed execution logs are in
${output_base_dir}/logs/.
The following environment variables control the reward models and RGG (Reward Gradient Guidance) behavior:
TRL_CORE_ENABLED: Enable pixel-space ORM (SoliReward)LATENT_REWARD_IN_TRL_REWARD_ENABLED: Enable latent-space PRMLATENT_REWARD_IN_TRL_REWARD_COEF: Coefficient for latent reward in total reward calculation (default: 0, computed separately in PRM advantage)
PROCESS_LATENT_REWARD_ENABLED: Enable process latent reward model computationPROCESS_LATENT_REWARD_SAMPLING_ENABLED: Enable RGG (Reward Gradient Guidance) during samplingPROCESS_LATENT_REWARD_TRAINING_ENABLED:β οΈ Deprecated - Not used
USE_REWARD_GUIDED_MEAN_FOR_LOGPROB: Use RGG to compute mean when calculating log probability during samplingUSE_REWARD_GUIDED_MEAN_FOR_LOGPROB_TRAINING: Use RGG to compute mean when calculating log probability during trainingPROCESS_LATENT_REWARD_GUIDANCE_SCALE: Guidance scale coefficient for RGG strengthUSE_DELTA_T_FOR_GRADIENT_SCALING: Whether to multiply RGG coefficient by delta_t (step size)
PROCESS_REWARD_ADVANTAGE_MODE: Mode for computing dual-reward advantageboth: Use both PRM and ORMnone: Use only ORM (Outcome Reward Model)only: Use only PRM (Process Reward Model)
SPSA_REWARD_ENABLED: Use SPSA-estimated ORM gradient for latents
Edit scripts/vis_hunyuanvideo.sh to set your model and data paths, then run:
bash scripts/vis_hunyuanvideo.shIf you use Euphonium, please cite our paper:
@article{zhong2026euphonium,
title={Euphonium: Steering Video Flow Matching via Process Reward Gradient Guided Stochastic Dynamics},
author={Zhong, Ruizhe and Lian, Jiesong and Mi, Xiaoyue and Zhou, Zixiang and Zhou, Yuan and Lu, Qinglin and Yan, Junchi},
journal={arXiv preprint arXiv:2602.04928},
year={2026}
}Apache License 2.0
We would like to thank the following projects for their contributions:
