Skip to content

sdxl: add Neuron 2K/4K high-res via img2img upscale (57.94s/2K, 142.62s/4K)#1

Open
jimburtoft wants to merge 1 commit into
xniwangaws:mainfrom
jimburtoft:sdxl-highres-img2img
Open

sdxl: add Neuron 2K/4K high-res via img2img upscale (57.94s/2K, 142.62s/4K)#1
jimburtoft wants to merge 1 commit into
xniwangaws:mainfrom
jimburtoft:sdxl-highres-img2img

Conversation

@jimburtoft

Copy link
Copy Markdown

Summary

Adds a working approach for SDXL high-resolution generation on Neuron that bypasses the monolithic compilation blockers (host RAM overflow at 2K, instruction count limit at 4K).

  • 2048x2048: 57.94s +/- 0.02s (10/10 seeds pass)
  • 4096x4096: 142.62s +/- 0.01s (3/3 seeds pass)
  • Neuron beats L4 FP8+compile: 1.29x faster at 2K, 3.86x faster at 4K
  • At 4K, Neuron is 22% cheaper than H100 BF16 ($0.089 vs $0.113/image)

Approach

Generate at 1024x1024 (proven compiled NEFFs) -> bicubic upscale -> tiled VAE encode -> add partial noise (strength=0.35, 18/50 steps) -> tiled denoise refinement -> tiled VAE decode.

This is the standard SDXL high-res workflow (same as SDXL Refiner pattern). The 1K generation establishes global coherence; tiled refinement only adds local high-frequency detail.

Files Added

  • sdxl-benchmark/highres_img2img/benchmark_img2img.py -- Self-contained script (compile + run + benchmark modes)
  • sdxl-benchmark/highres_img2img/README.md -- Approach explanation, results, usage
  • sdxl-benchmark/highres_img2img/results.json -- Structured benchmark data
  • sdxl-benchmark/highres_img2img/results_2048/seed42.png -- Sample 2K output
  • sdxl-benchmark/highres_img2img/results_4096/seed42.png -- Sample 4K output

README.en.md Updates

  • Updated 2K/4K tables (Neuron no longer shows "compile blocked")
  • Added Neuron img2img rows with latency, $/image, and comparison ratios
  • Updated conclusions section (Neuron beats L4 at all resolutions)
  • Added reproduction instructions for img2img approach

Key Technical Notes

  • scale_model_input() is critical for EulerDiscreteScheduler correctness
  • Naive tiled diffusion (MultiDiffusion from pure noise) does NOT work for SDXL -- self-attention needs global context
  • The img2img upscale approach works because the 1K base provides global coherence
  • 7 compiled NEFFs needed (adds VAE encoder + quant_conv vs standard 5)

…2s/4K)

Adds a working approach for SDXL high-resolution generation on Neuron that
bypasses the monolithic compilation blockers (host RAM overflow at 2K,
instruction count limit at 4K).

Approach: Generate at 1024x1024 -> upscale -> tiled VAE encode -> partial
noise (strength=0.35) -> tiled denoise refinement (18 steps) -> tiled VAE
decode. Uses existing 1K compiled NEFFs with no recompilation needed.

Results:
- 2048x2048: 57.94s +/- 0.02s (10/10 seeds pass)
- 4096x4096: 142.62s +/- 0.01s (3/3 seeds pass)

Neuron beats L4 FP8+compile at both resolutions:
- 2K: 1.29x faster (57.94s vs 74.85s)
- 4K: 3.86x faster (142.62s vs 550.21s)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant