🚀 Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

PyTorch implementation of Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

📚 Quick Start

Run the following script to start inference:

bash run.sh

Argument	Description
`--save_dir`	Directory to save the generated results
`--device`	Device to use for inference (`cuda`, `cpu`, etc.)
`--num_images_per_prompt`	Number of images to generate per prompt
`--num_inference_steps`	Total denoising steps; higher means better quality but slower generation
`--model_path`	Path to the diffusion model (e.g., HunyuanDiT)
`--vl_model_path`	Path to the vision-language model for evaluation (e.g., Qwen2.5-VL)
`--seed`	Random seed for reproducibility
`--process_steps`	Number of self-reflection optimization steps
`--process_steps_interval`	Interval between each self-reflection step during generation
`--process_start`	Fraction of inference steps after which to start self-reflection (e.g., 0.1)
`--process_end`	Fraction of inference steps to stop self-reflection (e.g., 0.9)
`--use_self_reflection`	Whether to enable self-reflection during image generation

Generated outputs are saved under the specified --save_dir directory. Example structure:

results/
├── image_0.png
├── image_1.png
├── ...
└── prompts.txt

Make sure the following libraries are installed:

Or install from requirements.txt