Skip to content

Persistent thread pool for multi-GPU CFG splitting#13329

Merged
Kosinkadink merged 1 commit intoworksplit-multigpufrom
worksplit-multigpu-wip
Apr 8, 2026
Merged

Persistent thread pool for multi-GPU CFG splitting#13329
Kosinkadink merged 1 commit intoworksplit-multigpufrom
worksplit-multigpu-wip

Conversation

@Kosinkadink
Copy link
Copy Markdown
Member

@Kosinkadink Kosinkadink commented Apr 8, 2026

Summary

Replaces the per-step thread create/destroy pattern in _calc_cond_batch_multigpu with a persistent MultiGPUThreadPool. Each worker thread calls torch.cuda.set_device() once at startup, preserving compiled kernel caches (inductor/triton) across diffusion steps.

Motivation

For models like SD3.5 large (fp8), thread creation/destruction every diffusion step causes cold compilation caches, leading to significant GPU idle time. Persistent threads keep the CUDA context and compiled kernels warm.

Changes

  • comfy/multigpu.py: Added MultiGPUThreadPool class with one persistent worker per extra GPU, using queue.Queue for dispatch and result collection
  • comfy/samplers.py:
    • Pool created in CFGGuider.outer_sample() and stored in model_options["multigpu_thread_pool"]
    • Pool shut down in the finally block alongside cleanup
    • _calc_cond_batch_multigpu submits extra GPU work to pool, main thread handles its own device directly (parallel execution)
    • Falls back gracefully if no pool is available

Testing

Benchmarked with NetaYumev35 (Lumina2) on 2x RTX 4090, 1024x1024, 30 steps:

Metric Before (thread-per-step) After (persistent pool) Δ
it/s 5.55 5.73 +3.2%
Time 5.83s 5.65s -3.1%

5 consecutive runs with zero errors.

API Node PR Checklist

Scope

  • Is API Node Change

Pricing & Billing

  • Need pricing update
  • No pricing update

If Need pricing update:

  • Metronome rate cards updated
  • Auto‑billing tests updated and passing

QA

  • QA done
  • QA not required

Comms

  • Informed Kosinkadink

@Kosinkadink Kosinkadink force-pushed the worksplit-multigpu-wip branch 2 times, most recently from 3578a80 to c0a7e10 Compare April 8, 2026 12:26
Replace per-step thread create/destroy in _calc_cond_batch_multigpu with a
persistent MultiGPUThreadPool. Each worker thread calls torch.cuda.set_device()
once at startup, preserving compiled kernel caches across diffusion steps.

- Add MultiGPUThreadPool class in comfy/multigpu.py
- Create pool in CFGGuider.outer_sample(), shut down in finally block
- Main thread handles its own device batch directly for zero overhead
- Falls back to sequential execution if no pool is available

Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019d3f5c-28c5-72c9-abed-34681f1b54ba
@Kosinkadink Kosinkadink force-pushed the worksplit-multigpu-wip branch from c0a7e10 to 9e4749a Compare April 8, 2026 12:27
@Kosinkadink Kosinkadink merged commit 4b93c43 into worksplit-multigpu Apr 8, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant