Persistent thread pool for multi-GPU CFG splitting by Kosinkadink · Pull Request #13329 · Comfy-Org/ComfyUI

Kosinkadink · 2026-04-08T11:43:26Z

Summary

Replaces the per-step thread create/destroy pattern in _calc_cond_batch_multigpu with a persistent MultiGPUThreadPool. Each worker thread calls torch.cuda.set_device() once at startup, preserving compiled kernel caches (inductor/triton) across diffusion steps.

Motivation

For models like SD3.5 large (fp8), thread creation/destruction every diffusion step causes cold compilation caches, leading to significant GPU idle time. Persistent threads keep the CUDA context and compiled kernels warm.

Changes

comfy/multigpu.py: Added MultiGPUThreadPool class with one persistent worker per extra GPU, using queue.Queue for dispatch and result collection
comfy/samplers.py:
- Pool created in CFGGuider.outer_sample() and stored in model_options["multigpu_thread_pool"]
- Pool shut down in the finally block alongside cleanup
- _calc_cond_batch_multigpu submits extra GPU work to pool, main thread handles its own device directly (parallel execution)
- Falls back gracefully if no pool is available

Testing

Benchmarked with NetaYumev35 (Lumina2) on 2x RTX 4090, 1024x1024, 30 steps:

Metric	Before (thread-per-step)	After (persistent pool)	Δ
it/s	5.55	5.73	+3.2%
Time	5.83s	5.65s	-3.1%

5 consecutive runs with zero errors.

API Node PR Checklist

Scope

Is API Node Change

Pricing & Billing

Need pricing update
No pricing update

If Need pricing update:

Metronome rate cards updated
Auto‑billing tests updated and passing

QA

QA done
QA not required

Comms

Informed Kosinkadink

Replace per-step thread create/destroy in _calc_cond_batch_multigpu with a persistent MultiGPUThreadPool. Each worker thread calls torch.cuda.set_device() once at startup, preserving compiled kernel caches across diffusion steps. - Add MultiGPUThreadPool class in comfy/multigpu.py - Create pool in CFGGuider.outer_sample(), shut down in finally block - Main thread handles its own device batch directly for zero overhead - Falls back to sequential execution if no pool is available Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-019d3f5c-28c5-72c9-abed-34681f1b54ba

Kosinkadink requested review from comfyanonymous and guill as code owners April 8, 2026 11:43

Kosinkadink force-pushed the worksplit-multigpu-wip branch 2 times, most recently from 3578a80 to c0a7e10 Compare April 8, 2026 12:26

Kosinkadink force-pushed the worksplit-multigpu-wip branch from c0a7e10 to 9e4749a Compare April 8, 2026 12:27

Kosinkadink merged commit 4b93c43 into worksplit-multigpu Apr 8, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent thread pool for multi-GPU CFG splitting#13329

Persistent thread pool for multi-GPU CFG splitting#13329
Kosinkadink merged 1 commit intoworksplit-multigpufrom
worksplit-multigpu-wip

Kosinkadink commented Apr 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kosinkadink commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Testing

API Node PR Checklist

Scope

Pricing & Billing

QA

Comms

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Kosinkadink commented Apr 8, 2026 •

edited

Loading