Persistent thread pool for multi-GPU CFG splitting#13329
Merged
Kosinkadink merged 1 commit intoworksplit-multigpufrom Apr 8, 2026
Merged
Persistent thread pool for multi-GPU CFG splitting#13329Kosinkadink merged 1 commit intoworksplit-multigpufrom
Kosinkadink merged 1 commit intoworksplit-multigpufrom
Conversation
3578a80 to
c0a7e10
Compare
Replace per-step thread create/destroy in _calc_cond_batch_multigpu with a persistent MultiGPUThreadPool. Each worker thread calls torch.cuda.set_device() once at startup, preserving compiled kernel caches across diffusion steps. - Add MultiGPUThreadPool class in comfy/multigpu.py - Create pool in CFGGuider.outer_sample(), shut down in finally block - Main thread handles its own device batch directly for zero overhead - Falls back to sequential execution if no pool is available Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-019d3f5c-28c5-72c9-abed-34681f1b54ba
c0a7e10 to
9e4749a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the per-step thread create/destroy pattern in
_calc_cond_batch_multigpuwith a persistentMultiGPUThreadPool. Each worker thread callstorch.cuda.set_device()once at startup, preserving compiled kernel caches (inductor/triton) across diffusion steps.Motivation
For models like SD3.5 large (fp8), thread creation/destruction every diffusion step causes cold compilation caches, leading to significant GPU idle time. Persistent threads keep the CUDA context and compiled kernels warm.
Changes
comfy/multigpu.py: AddedMultiGPUThreadPoolclass with one persistent worker per extra GPU, usingqueue.Queuefor dispatch and result collectioncomfy/samplers.py:CFGGuider.outer_sample()and stored inmodel_options["multigpu_thread_pool"]finallyblock alongside cleanup_calc_cond_batch_multigpusubmits extra GPU work to pool, main thread handles its own device directly (parallel execution)Testing
Benchmarked with NetaYumev35 (Lumina2) on 2x RTX 4090, 1024x1024, 30 steps:
5 consecutive runs with zero errors.
API Node PR Checklist
Scope
Pricing & Billing
If Need pricing update:
QA
Comms