sdxl: add Neuron 2K/4K high-res via img2img upscale (57.94s/2K, 142.62s/4K) by jimburtoft · Pull Request #1 · xniwangaws/NeuronStuff

jimburtoft · 2026-05-23T15:48:44Z

Summary

Adds a working approach for SDXL high-resolution generation on Neuron that bypasses the monolithic compilation blockers (host RAM overflow at 2K, instruction count limit at 4K).

2048x2048: 57.94s +/- 0.02s (10/10 seeds pass)
4096x4096: 142.62s +/- 0.01s (3/3 seeds pass)
Neuron beats L4 FP8+compile: 1.29x faster at 2K, 3.86x faster at 4K
At 4K, Neuron is 22% cheaper than H100 BF16 ($0.089 vs $0.113/image)

Approach

Generate at 1024x1024 (proven compiled NEFFs) -> bicubic upscale -> tiled VAE encode -> add partial noise (strength=0.35, 18/50 steps) -> tiled denoise refinement -> tiled VAE decode.

This is the standard SDXL high-res workflow (same as SDXL Refiner pattern). The 1K generation establishes global coherence; tiled refinement only adds local high-frequency detail.

Files Added

sdxl-benchmark/highres_img2img/benchmark_img2img.py -- Self-contained script (compile + run + benchmark modes)
sdxl-benchmark/highres_img2img/README.md -- Approach explanation, results, usage
sdxl-benchmark/highres_img2img/results.json -- Structured benchmark data
sdxl-benchmark/highres_img2img/results_2048/seed42.png -- Sample 2K output
sdxl-benchmark/highres_img2img/results_4096/seed42.png -- Sample 4K output

README.en.md Updates

Updated 2K/4K tables (Neuron no longer shows "compile blocked")
Added Neuron img2img rows with latency, $/image, and comparison ratios
Updated conclusions section (Neuron beats L4 at all resolutions)
Added reproduction instructions for img2img approach

Key Technical Notes

scale_model_input() is critical for EulerDiscreteScheduler correctness
Naive tiled diffusion (MultiDiffusion from pure noise) does NOT work for SDXL -- self-attention needs global context
The img2img upscale approach works because the 1K base provides global coherence
7 compiled NEFFs needed (adds VAE encoder + quant_conv vs standard 5)

…2s/4K) Adds a working approach for SDXL high-resolution generation on Neuron that bypasses the monolithic compilation blockers (host RAM overflow at 2K, instruction count limit at 4K). Approach: Generate at 1024x1024 -> upscale -> tiled VAE encode -> partial noise (strength=0.35) -> tiled denoise refinement (18 steps) -> tiled VAE decode. Uses existing 1K compiled NEFFs with no recompilation needed. Results: - 2048x2048: 57.94s +/- 0.02s (10/10 seeds pass) - 4096x4096: 142.62s +/- 0.01s (3/3 seeds pass) Neuron beats L4 FP8+compile at both resolutions: - 2K: 1.29x faster (57.94s vs 74.85s) - 4K: 3.86x faster (142.62s vs 550.21s)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sdxl: add Neuron 2K/4K high-res via img2img upscale (57.94s/2K, 142.62s/4K)#1

sdxl: add Neuron 2K/4K high-res via img2img upscale (57.94s/2K, 142.62s/4K)#1
jimburtoft wants to merge 1 commit into
xniwangaws:mainfrom
jimburtoft:sdxl-highres-img2img

jimburtoft commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented May 23, 2026

Summary

Approach

Files Added

README.en.md Updates

Key Technical Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant