CogVideoX by Talmaj · Pull Request #13351 · Comfy-Org/ComfyUI

Talmaj · 2026-04-10T09:31:31Z

This is a pre-requirement PR for CORE-38, netflix's VOID model.
It was written by @kijai and rebased by myself. I've tested and used it only with VOID model and it works well.

coderabbitai · 2026-04-10T09:39:00Z

📝 Walkthrough

Walkthrough

Adds full CogVideoX support across the codebase: a new latent format, a Comfy-native 3D transformer (CogVideoXTransformer3DModel) with patch/positional/rotary utilities, two CogVideoX-style VAEs (including a streaming/causal implementation), a V_PREDICTION_DDPM sampling class, checkpoint-detection logic for CogVideoX VAEs/UNets, a new ModelType and CogVideoX BaseModel wrapper, two supported model entries (CogVideoX_T2V and CogVideoX_I2V), and a minor extras import formatting tweak.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.31% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'CogVideoX' is vague and generic, failing to convey the specific nature of the changes or their purpose.	Use a more descriptive title that explains what CogVideoX support includes, such as 'Add CogVideoX transformer and VAE implementations' or 'Implement CogVideoX video generation model support'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description mentions this is a pre-requirement PR for CORE-38 and the VOID model, and notes testing with that model, which relates to the changeset's objective.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

comfy/ldm/cogvideo/vae.py (2)

166-193: Use ops.Conv2d for consistency with ComfyUI's dtype handling.

Downsample3D uses nn.Conv2d directly (line 170) while the rest of the module uses ops.Conv3d from comfy.ops.disable_weight_init. This bypasses ComfyUI's weight casting mechanism, which could cause dtype mismatches when running in mixed precision modes (e.g., fp16 input with fp32 weights).

Suggested fix

 class Downsample3D(nn.Module):
     """3D downsampling with optional temporal compression."""
     def __init__(self, in_channels, out_channels, kernel_size=3, stride=2, padding=0, compress_time=False):
         super().__init__()
-        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding)
+        self.conv = ops.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding)
         self.compress_time = compress_time

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@comfy/ldm/cogvideo/vae.py` around lines 166 - 193, Downsample3D currently
constructs self.conv with nn.Conv2d which bypasses ComfyUI's weight casting;
replace that instantiation in Downsample3D.__init__ with ops.Conv2d (from
comfy.ops.disable_weight_init) using the same arguments (in_channels,
out_channels, kernel_size, stride, padding) so weights participate in ComfyUI
dtype handling; ensure the module imports or references
comfy.ops.disable_weight_init.ops (or the same ops alias used elsewhere in the
file) and keep the attribute name self.conv unchanged.

196-226: Same issue: use ops.Conv2d for consistency.

Upsample3D also uses nn.Conv2d directly (line 200). Apply the same fix for dtype handling consistency.

Suggested fix

 class Upsample3D(nn.Module):
     """3D upsampling with optional temporal decompression."""
     def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1, compress_time=False):
         super().__init__()
-        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding)
+        self.conv = ops.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding)
         self.compress_time = compress_time

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@comfy/ldm/cogvideo/vae.py` around lines 196 - 226, The Upsample3D class
currently constructs self.conv with nn.Conv2d which bypasses the project's
dtype-aware conv wrapper; replace that instantiation with ops.Conv2d (i.e.,
change the constructor call in Upsample3D.__init__ from nn.Conv2d(...) to
ops.Conv2d(...)) so the module uses the same dtype/dispatch behavior as other
layers; ensure ops is imported/available in the module and no other call sites
in Upsample3D (e.g., references to self.conv in forward) need changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy_extras/nodes_cogvideox.py`:
- Around line 98-105: The code treats any ref_guidance_scale != 1.0 as a boolean
and always uses zeros for the negative conditioning; change this so the negative
(and optionally positive) conditioning actually reflects ref_guidance_scale
instead of always zeroing: in the block that checks ref_guidance_scale and
ref_frames, keep positive = node_helpers.conditioning_set_values(positive,
{"concat_latent_image": ref_latent}) but set negative =
node_helpers.conditioning_set_values(negative, {"concat_latent_image":
ref_latent * (1.0 - float(ref_guidance_scale))}) (or use a clamped interpolation
between torch.zeros_like(ref_latent) and ref_latent based on ref_guidance_scale)
so varying ref_guidance_scale produces proportional conditioning; reference
symbols: ref_guidance_scale, ref_frames, ref_latent, positive, negative,
node_helpers.conditioning_set_values.
- Around line 52-56: The created latent uses UI batch_size (variable latent) but
the function later returns lq_latent produced by vae.encode (symbols: latent,
lq_latent, vae.encode, batch_size, video), so UI batch_size is ignored; fix by
ensuring lq_latent matches batch_size before returning—either repeat/expand
lq_latent along dim 0 to batch_size (or repeat the input video before calling
vae.encode) when lq_latent.shape[0] != batch_size so downstream receives a
tensor with the requested batch size.

In `@comfy/ldm/cogvideo/model.py`:
- Around line 131-156: forward() currently rebuilds sin/cos positional
embeddings instead of using the stored buffer, so learned positional checkpoints
in self.pos_embedding are ignored; update forward() to prefer reading the
registered buffer self.pos_embedding when use_learned_positional_embeddings (or
when self.pos_embedding exists) and only fall back to calling
_get_positional_embeddings when the buffer is missing or shape-mismatched.
Ensure the buffer registered in register_buffer("pos_embedding", ...) (and the
joint_pos_embedding layout with self.max_text_seq_length) is used unchanged (or
interpolated if spatial/temporal scales differ) so learned positions persist
across loads; also apply the same fix to the analogous code path referenced at
lines 181-198 that rebuilds sin/cos embeddings.
- Line 259: The forward method uses a mutable default parameter
transformer_options={} which can leak state; change the signature of forward to
use transformer_options=None and inside the method set transformer_options = {}
if it's None (or create a shallow copy when provided) so each call gets its own
dict; update any type hints (e.g., Optional[dict]) and ensure subsequent code
(and the patcher_extension that calls .setdefault()) operates on this per-call
dict rather than a shared default.

In `@comfy/model_base.py`:
- Around line 1975-1995: concat_cond currently only matches H/W and batch but
drops any temporal dimension and doesn't ensure the returned concat tensor has
exactly extra_channels, which breaks video latents and causes channel-mismatch
in _apply_model(); fix by preserving/aligning a time axis (if present) the same
way BaseModel.concat_cond does (trim or pad the temporal dimension to match
noise), apply utils.common_upscale and process_latent_in per-latent-chunk as you
already do, then normalize the channel count to extra_channels by trimming or
repeating/tiling processed latent chunks (using latent_format.latent_channels
and process_latent_in) so the final tensor shape matches (batch, extra_channels,
[time,] H, W), ensure dtype/layout/device match noise, and keep using
utils.resize_to_batch_size to match batch size before returning.

In `@comfy/supported_models.py`:
- Around line 1797-1799: CogVideoX_I2V.get_model currently constructs
model_base.CogVideoX directly and skips the patch_size_t handling implemented in
CogVideoX_T2V.get_model, causing 1.5 checkpoints to use the transformer's
fallback geometry; copy the same logic from CogVideoX_T2V.get_model into
CogVideoX_I2V.get_model so that you detect 1.5-style checkpoints and set the
patch_size_t / sample_height/sample_width/sample_frames defaults (the 96/170/81
geometry) before instantiating model_base.CogVideoX (or pass the computed
patch_size_t into model_base.CogVideoX), ensuring both I2V and T2V paths create
the same positional geometry.

---

Nitpick comments:
In `@comfy/ldm/cogvideo/vae.py`:
- Around line 166-193: Downsample3D currently constructs self.conv with
nn.Conv2d which bypasses ComfyUI's weight casting; replace that instantiation in
Downsample3D.__init__ with ops.Conv2d (from comfy.ops.disable_weight_init) using
the same arguments (in_channels, out_channels, kernel_size, stride, padding) so
weights participate in ComfyUI dtype handling; ensure the module imports or
references comfy.ops.disable_weight_init.ops (or the same ops alias used
elsewhere in the file) and keep the attribute name self.conv unchanged.
- Around line 196-226: The Upsample3D class currently constructs self.conv with
nn.Conv2d which bypasses the project's dtype-aware conv wrapper; replace that
instantiation with ops.Conv2d (i.e., change the constructor call in
Upsample3D.__init__ from nn.Conv2d(...) to ops.Conv2d(...)) so the module uses
the same dtype/dispatch behavior as other layers; ensure ops is
imported/available in the module and no other call sites in Upsample3D (e.g.,
references to self.conv in forward) need changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: faae2daf-0395-4cd1-a6bd-bb690298ec44

📥 Commits

Reviewing files that changed from the base of the PR and between e6be419 and 3399cfe.

📒 Files selected for processing (13)

comfy/latent_formats.py
comfy/ldm/cogvideo/__init__.py
comfy/ldm/cogvideo/model.py
comfy/ldm/cogvideo/vae.py
comfy/ldm/cogvideo/vae_backup.py
comfy/model_base.py
comfy/model_detection.py
comfy/model_sampling.py
comfy/sd.py
comfy/supported_models.py
comfy_extras/nodes_cogvideox.py
convert_sparkvsr_to_comfy.py
nodes.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/model_base.py`:
- Around line 2018-2023: The caller-provided ofs from kwargs may be a scalar
(int/float) and is passed unchanged to comfy.conds.CONDRegular, causing later
tensor operations to fail; in the branch inside the if
self.diffusion_model.ofs_proj_dim is not None block, coerce/of convert
kwargs.get("ofs") into a torch tensor matching the batch size/device/dtype of
noise (like using torch.full or torch.tensor with device=noise.device and
dtype=noise.dtype and shape (noise.shape[0],)), before assigning out['ofs'] =
comfy.conds.CONDRegular(ofs); ensure you still handle None by falling back to
the existing default creation path and preserve dtype/device from noise.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7611f40f-e37d-45d2-96e5-7141a7870101

📥 Commits

Reviewing files that changed from the base of the PR and between 3399cfe and c6a38ff.

📒 Files selected for processing (5)

comfy/ldm/cogvideo/model.py
comfy/model_base.py
comfy/model_detection.py
comfy/sd.py
comfy/supported_models.py

🚧 Files skipped from review as they are similar to previous changes (1)

comfy/model_detection.py

coderabbitai · 2026-04-10T18:30:27Z

+        if self.diffusion_model.ofs_proj_dim is not None:
+            ofs = kwargs.get("ofs", None)
+            if ofs is None:
+                noise = kwargs.get("noise", None)
+                ofs = torch.full((noise.shape[0],), 2.0, device=noise.device, dtype=noise.dtype)
+            out['ofs'] = comfy.conds.CONDRegular(ofs)


⚠️ Potential issue | 🟡 Minor

Convert caller-provided ofs values to a tensor before wrapping them.

The default path creates a tensor, but Line 2019 accepts raw kwargs unchanged. If a workflow passes ofs as a Python float/int, CONDRegular later treats it like a tensor during batching/concat and that override path breaks.

Proposed fix

if self.diffusion_model.ofs_proj_dim is not None: + noise = kwargs.get("noise", None) ofs = kwargs.get("ofs", None) if ofs is None: - noise = kwargs.get("noise", None) ofs = torch.full((noise.shape[0],), 2.0, device=noise.device, dtype=noise.dtype) + else: + ofs = torch.as_tensor(ofs, device=noise.device, dtype=noise.dtype).reshape(-1) out['ofs'] = comfy.conds.CONDRegular(ofs)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/model_base.py` around lines 2018 - 2023, The caller-provided ofs from kwargs may be a scalar (int/float) and is passed unchanged to comfy.conds.CONDRegular, causing later tensor operations to fail; in the branch inside the if self.diffusion_model.ofs_proj_dim is not None block, coerce/of convert kwargs.get("ofs") into a torch tensor matching the batch size/device/dtype of noise (like using torch.full or torch.tensor with device=noise.device and dtype=noise.dtype and shape (noise.shape[0],)), before assigning out['ofs'] = comfy.conds.CONDRegular(ofs); ensure you still handle None by falling back to the existing default creation path and preserve dtype/device from noise.

…channel count

Talmaj · 2026-04-10T18:35:37Z

@@ -0,0 +1,485 @@
+# CogVideoX VAE - ported to ComfyUI native ops


@kijai I assume this backup file can also just be removed?

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

comfy/model_base.py (1)

2018-2023: ⚠️ Potential issue | 🟠 Major

Coerce caller-provided ofs to a tensor before wrapping it.

The default branch builds a tensor, but a user-supplied ofs can still be a Python int/float here. That value then reaches CogVideoXTransformer3DModel.forward() unchanged, where the OFS path feeds it into timestep embedding and can fail at runtime. Normalizing it to a tensor on noise.device/noise.dtype keeps the new conditioning input compatible with scalar workflow values.

🛠️ Proposed fix

     def extra_conds(self, **kwargs):
         out = super().extra_conds(**kwargs)
         # OFS embedding (CogVideoX 1.5 I2V), default 2.0 as used by SparkVSR
         if self.diffusion_model.ofs_proj_dim is not None:
+            noise = kwargs.get("noise", None)
             ofs = kwargs.get("ofs", None)
             if ofs is None:
-                noise = kwargs.get("noise", None)
                 ofs = torch.full((noise.shape[0],), 2.0, device=noise.device, dtype=noise.dtype)
+            else:
+                ofs = torch.as_tensor(ofs, device=noise.device, dtype=noise.dtype).reshape(-1)
             out['ofs'] = comfy.conds.CONDRegular(ofs)
         return out

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@comfy/model_base.py` around lines 2018 - 2023, The caller-provided ofs may be
a Python scalar and needs coercion to a tensor before wrapping in
comfy.conds.CONDRegular to avoid runtime failures in
CogVideoXTransformer3DModel.forward(); update the block that reads ofs (when
self.diffusion_model.ofs_proj_dim is not None) so that if kwargs.get("ofs") is
not None you convert it to a torch tensor on noise.device and noise.dtype (use
torch.as_tensor or torch.tensor), and if it is a scalar expand it to shape
(noise.shape[0],) to match the default branch, then assign out['ofs'] =
comfy.conds.CONDRegular(ofs).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/ldm/cogvideo/vae.py`:
- Around line 453-467: The encode method accepts arbitrary frame counts which
can break the 1+4k temporal stride contract; before any batching logic in encode
(function comfy/ldm/cogvideo/vae.py::encode), clamp/truncate the input tensor x
along the temporal dimension to the largest prefix that satisfies valid_t = 1 +
4*k where k = (t-1)//4 (ensure at least 1 frame), e.g. compute t = x.shape[2],
valid_t = 1 + 4*((t-1)//4) and if valid_t < t replace x with x[:, :, :valid_t]
so downstream chunking/decoding round-trips cleanly; keep the rest of the encode
flow (frame batching, conv_cache, enc accumulation) unchanged.

In `@comfy/model_sampling.py`:
- Around line 62-76: calculate_input currently returns raw noise but must
construct DDPM x_t in normalized space; change calculate_input to return
latent_image + sigma * noise (using reshape_sigma as needed). Fix noise_scaling
so it only scales the noise by sigma (do not use sqrt(1+sigma**2)) and only add
latent_image when producing x_t; remove the max_denoise branch applying
torch.sqrt(1.0 + sigma**2). Leave calculate_denoised as-is (it expects
model_input to be DDPM x_t) and ensure all references use the reshaped sigma
variable.

In `@nodes.py`:
- Around line 2460-2461: init_builtin_extra_nodes() currently lists
"nodes_cogvideox.py" which doesn't exist and causes CI/startup failures; either
remove the missing entry from the built-in extra nodes list (so only keep
existing entries like "nodes_rtdetr.py") or add the actual
comfy_extras/nodes_cogvideox.py module with its comfy_entrypoint()/extension
registration in the same PR; update the list referenced by
init_builtin_extra_nodes() to only include files that are present or ensure the
new module is added before registering it.

---

Duplicate comments:
In `@comfy/model_base.py`:
- Around line 2018-2023: The caller-provided ofs may be a Python scalar and
needs coercion to a tensor before wrapping in comfy.conds.CONDRegular to avoid
runtime failures in CogVideoXTransformer3DModel.forward(); update the block that
reads ofs (when self.diffusion_model.ofs_proj_dim is not None) so that if
kwargs.get("ofs") is not None you convert it to a torch tensor on noise.device
and noise.dtype (use torch.as_tensor or torch.tensor), and if it is a scalar
expand it to shape (noise.shape[0],) to match the default branch, then assign
out['ofs'] = comfy.conds.CONDRegular(ofs).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a55b5298-f1ec-4fc9-af41-c44255925312

📥 Commits

Reviewing files that changed from the base of the PR and between c6a38ff and 1411d1b.

📒 Files selected for processing (11)

comfy/latent_formats.py
comfy/ldm/cogvideo/__init__.py
comfy/ldm/cogvideo/model.py
comfy/ldm/cogvideo/vae.py
comfy/ldm/cogvideo/vae_backup.py
comfy/model_base.py
comfy/model_detection.py
comfy/model_sampling.py
comfy/sd.py
comfy/supported_models.py
nodes.py

✅ Files skipped from review due to trivial changes (1)

comfy/latent_formats.py

🚧 Files skipped from review as they are similar to previous changes (1)

comfy/sd.py

coderabbitai · 2026-04-10T18:37:42Z

+    def encode(self, x):
+        t = x.shape[2]
+        frame_batch = self.num_sample_frames_batch_size
+        num_batches = max(t // frame_batch, 1)
+        conv_cache = None
+        enc = []
+        for i in range(num_batches):
+            remaining = t % frame_batch
+            start = frame_batch * i + (0 if i == 0 else remaining)
+            end = frame_batch * (i + 1) + remaining
+            chunk, conv_cache = self.encoder(x[:, :, start:end], conv_cache=conv_cache)
+            enc.append(chunk.to(x.device))
+        enc = torch.cat(enc, dim=2)
+        mean, _ = enc.chunk(2, dim=1)
+        return mean


⚠️ Potential issue | 🟠 Major

Clamp input videos to the VAE’s valid temporal stride before encoding.

The active CogVideoX wrapper in comfy/sd.py assumes a 1 + 4k temporal contract (downscale_ratio=(a+3)//4, upscale_ratio=a*4-3), but encode() currently accepts arbitrary frame counts. That means non-conformant clips can produce latent lengths that do not round-trip cleanly through decode/tiled paths.

Suggested fix

def encode(self, x): + valid_frames = max( + 1, + 1 + ((x.shape[2] - 1) // self.temporal_compression_ratio) * self.temporal_compression_ratio, + ) + x = x[:, :, :valid_frames] t = x.shape[2] frame_batch = self.num_sample_frames_batch_size

Based on learnings: In comfy/ldm/lightricks/vae/causal_video_autoencoder.py, replacing hard failures with silent truncation to the valid temporal-stride prefix was intentional and strictly better than exception-raising.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/ldm/cogvideo/vae.py` around lines 453 - 467, The encode method accepts arbitrary frame counts which can break the 1+4k temporal stride contract; before any batching logic in encode (function comfy/ldm/cogvideo/vae.py::encode), clamp/truncate the input tensor x along the temporal dimension to the largest prefix that satisfies valid_t = 1 + 4*k where k = (t-1)//4 (ensure at least 1 frame), e.g. compute t = x.shape[2], valid_t = 1 + 4*((t-1)//4) and if valid_t < t replace x with x[:, :, :valid_t] so downstream chunking/decoding round-trips cleanly; keep the rest of the encode flow (frame batching, conv_cache, enc accumulation) unchanged.

coderabbitai · 2026-04-10T18:37:42Z

+    def calculate_input(self, sigma, noise):
+        return noise
+
+    def calculate_denoised(self, sigma, model_output, model_input):
+        sigma = reshape_sigma(sigma, model_output.ndim)
+        return model_input / (sigma ** 2 + 1.0) ** 0.5 - model_output * sigma / (sigma ** 2 + 1.0) ** 0.5
+
+    def noise_scaling(self, sigma, noise, latent_image, max_denoise=False):
+        sigma = reshape_sigma(sigma, noise.ndim)
+        if max_denoise:
+            noise = noise * torch.sqrt(1.0 + sigma ** 2.0)
+        else:
+            noise = noise * sigma
+        noise += latent_image
+        return noise


⚠️ Potential issue | 🔴 Critical

Construct DDPM x_t in normalized space, not K-diffusion space.

calculate_input()/calculate_denoised() assume model_input is raw DDPM x_t, but noise_scaling() still returns latent_image + sigma * noise. That injects an extra sqrt(1 + sigma^2) factor, so CogVideoX gets the wrong input distribution and high-sigma steps are badly mis-scaled.

Suggested fix

def noise_scaling(self, sigma, noise, latent_image, max_denoise=False): sigma = reshape_sigma(sigma, noise.ndim) - if max_denoise: - noise = noise * torch.sqrt(1.0 + sigma ** 2.0) - else: - noise = noise * sigma - noise += latent_image - return noise + return (latent_image + noise * sigma) / torch.sqrt(1.0 + sigma ** 2.0)

Because CogVideoX defaults to ModelType.V_PREDICTION_DDPM, this one propagates straight into the new model path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/model_sampling.py` around lines 62 - 76, calculate_input currently returns raw noise but must construct DDPM x_t in normalized space; change calculate_input to return latent_image + sigma * noise (using reshape_sigma as needed). Fix noise_scaling so it only scales the noise by sigma (do not use sqrt(1+sigma**2)) and only add latent_image when producing x_t; remove the max_denoise branch applying torch.sqrt(1.0 + sigma**2). Leave calculate_denoised as-is (it expects model_input to be DDPM x_t) and ensure all references use the reshaped sigma variable.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/ldm/cogvideo/model.py`:
- Around line 531-542: The temporal RoPE currently always uses torch.arange(t)
causing sample_frames to be ignored in 1.5 "slice" mode; update the branch where
self.patch_size_t is not None to mirror the spatial logic: read the trained base
temporal length from self.patch_embed (e.g. self.patch_embed.sample_frames or
the appropriate attribute), compute max_t = max(base_t, t), build grid_t =
torch.arange(max_t, device=device, dtype=torch.float32) and then slice grid_t =
grid_t[:t] (or otherwise index down to runtime t), and apply the same change in
the second occurrence mentioned around the 556-570 block so temporal positions
follow the trained base timeline before slicing.

In `@comfy/model_detection.py`:
- Around line 531-537: The current logic treats a missing pos_embedding as
evidence to enable rotary embeddings; instead implement a three-way decision: if
pos_key ('{}patch_embed.pos_embedding'.format(key_prefix)) is present set
dit_config["use_learned_positional_embeddings"]=True and use_rotary=False; else
detect a temporal-patching / 1.5-style layout by scanning state_dict_keys for
known temporal/1.5 markers (e.g. any key under the same patch_embed prefix
containing 'temporal', 'time', or 'temporal_patch' or other
CogVideoXPatchEmbed-specific temporal names) and in that case set
use_rotary_positional_embeddings=True and
use_learned_positional_embeddings=False; otherwise set both
use_rotary_positional_embeddings=False and
use_learned_positional_embeddings=False. Reference pos_key, state_dict_keys,
dit_config and CogVideoXPatchEmbed when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2a1f5479-f3e5-409d-bda1-70b3eab086b7

📥 Commits

Reviewing files that changed from the base of the PR and between 1411d1b and 845da1b.

📒 Files selected for processing (6)

comfy/ldm/cogvideo/model.py
comfy/model_base.py
comfy/model_detection.py
comfy/sd.py
comfy/supported_models.py
nodes.py

✅ Files skipped from review due to trivial changes (1)

nodes.py

🚧 Files skipped from review as they are similar to previous changes (2)

comfy/sd.py
comfy/model_base.py

coderabbitai · 2026-04-10T18:43:21Z

+        if self.patch_size_t is not None:
+            # CogVideoX 1.5: "slice" mode — positions are simple integer indices
+            # Compute at max(sample_size, actual_size) then slice to actual
+            base_h = self.patch_embed.sample_height // self.patch_size
+            base_w = self.patch_embed.sample_width // self.patch_size
+            max_h = max(base_h, h)
+            max_w = max(base_w, w)
+
+            grid_h = torch.arange(max_h, device=device, dtype=torch.float32)
+            grid_w = torch.arange(max_w, device=device, dtype=torch.float32)
+            grid_t = torch.arange(t, device=device, dtype=torch.float32)
+        else:


⚠️ Potential issue | 🟠 Major

sample_frames is ignored in 1.5 rotary mode.

Lines 531-542 preserve the trained base grid for height/width, but temporal positions always come from torch.arange(t). That means the sample_frames=81 defaults set for 1.5 checkpoints in comfy/supported_models.py never affect temporal RoPE, so shorter clips get different temporal indexing than the model was trained with.

Please make the temporal axis follow the same slice-mode rule as H/W: build the temporal grid from the configured 1.5 base timeline, then slice down to the runtime clip length. Right now only the spatial axes keep the trained base geometry.

Also applies to: 556-570

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/ldm/cogvideo/model.py` around lines 531 - 542, The temporal RoPE currently always uses torch.arange(t) causing sample_frames to be ignored in 1.5 "slice" mode; update the branch where self.patch_size_t is not None to mirror the spatial logic: read the trained base temporal length from self.patch_embed (e.g. self.patch_embed.sample_frames or the appropriate attribute), compute max_t = max(base_t, t), build grid_t = torch.arange(max_t, device=device, dtype=torch.float32) and then slice grid_t = grid_t[:t] (or otherwise index down to runtime t), and apply the same change in the second occurrence mentioned around the 556-570 block so temporal positions follow the trained base timeline before slicing.

coderabbitai · 2026-04-10T18:43:22Z

+        pos_key = '{}patch_embed.pos_embedding'.format(key_prefix)
+        if pos_key in state_dict_keys:
+            dit_config["use_learned_positional_embeddings"] = True
+            dit_config["use_rotary_positional_embeddings"] = False
+        else:
+            dit_config["use_learned_positional_embeddings"] = False
+            dit_config["use_rotary_positional_embeddings"] = True


⚠️ Potential issue | 🟠 Major

Don’t infer RoPE from a missing patch_embed.pos_embedding key.

This fallback conflates two different cases. In comfy/ldm/cogvideo/model.py, CogVideoXPatchEmbed only persists pos_embedding when use_learned_positional_embeddings is true, so fixed sin/cos checkpoints also omit that key. With the current else, those 1.0-style checkpoints get loaded as use_rotary_positional_embeddings=True, which changes token positions at inference.

A safer split here is:

pos_embedding present → learned positional embeddings

temporal patching / 1.5-style layout detected → rotary positional embeddings

otherwise → fixed sin/cos (use_rotary_positional_embeddings=False, use_learned_positional_embeddings=False)

Suggested fix

pos_key = '{}patch_embed.pos_embedding'.format(key_prefix) if pos_key in state_dict_keys: dit_config["use_learned_positional_embeddings"] = True dit_config["use_rotary_positional_embeddings"] = False - else: + elif dit_config.get("patch_size_t") is not None: dit_config["use_learned_positional_embeddings"] = False dit_config["use_rotary_positional_embeddings"] = True + else: + dit_config["use_learned_positional_embeddings"] = False + dit_config["use_rotary_positional_embeddings"] = False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/model_detection.py` around lines 531 - 537, The current logic treats a missing pos_embedding as evidence to enable rotary embeddings; instead implement a three-way decision: if pos_key ('{}patch_embed.pos_embedding'.format(key_prefix)) is present set dit_config["use_learned_positional_embeddings"]=True and use_rotary=False; else detect a temporal-patching / 1.5-style layout by scanning state_dict_keys for known temporal/1.5 markers (e.g. any key under the same patch_embed prefix containing 'temporal', 'time', or 'temporal_patch' or other CogVideoXPatchEmbed-specific temporal names) and in that case set use_rotary_positional_embeddings=True and use_learned_positional_embeddings=False; otherwise set both use_rotary_positional_embeddings=False and use_learned_positional_embeddings=False. Reference pos_key, state_dict_keys, dit_config and CogVideoXPatchEmbed when making the change.

Talmaj · 2026-04-10T18:44:27Z

+        out = model_base.CogVideoX(self, image_to_video=True, device=device)
+        return out
+
+models = [LotusD, Stable_Zero123, SD15_instructpix2pix, SD15, SD20, SD21UnclipL, SD21UnclipH, SDXL_instructpix2pix, SDXLRefiner, SDXL, SSD1B, KOALA_700M, KOALA_1B, Segmind_Vega, SD_X4Upscaler, Stable_Cascade_C, Stable_Cascade_B, SV3D_u, SV3D_p, SD3, StableAudio, AuraFlow, PixArtAlpha, PixArtSigma, HunyuanDiT, HunyuanDiT1, FluxInpaint, Flux, LongCatImage, FluxSchnell, GenmoMochi, LTXV, LTXAV, HunyuanVideo15_SR_Distilled, HunyuanVideo15, HunyuanImage21Refiner, HunyuanImage21, HunyuanVideoSkyreelsI2V, HunyuanVideoI2V, HunyuanVideo, CosmosT2V, CosmosI2V, CosmosT2IPredict2, CosmosI2VPredict2, ZImagePixelSpace, ZImage, Lumina2, WAN22_T2V, WAN21_T2V, WAN21_I2V, WAN21_FunControl2V, WAN21_Vace, WAN21_Camera, WAN22_Camera, WAN22_S2V, WAN21_HuMo, WAN22_Animate, WAN21_FlowRVS, WAN21_SCAIL, Hunyuan3Dv2mini, Hunyuan3Dv2, Hunyuan3Dv2_1, HiDream, Chroma, ChromaRadiance, ACEStep, ACEStep15, Omnigen2, QwenImage, Flux2, Kandinsky5Image, Kandinsky5, Anima, RT_DETR_v4, CogVideoX_I2V, CogVideoX_T2V]


@comfyanonymous @Kosinkadink I think that this list should be formatted in a way that each element goes into a separate line. In this way the diff is very clear. Right now with every change, the whole array is shown as changed.
Why I mention this? When rebasing or merging with master one could easily remove a model that was already added if not careful.
What do you think?

Talmaj · 2026-04-14T12:32:31Z

Closing for this one: #13402

Talmaj requested review from Kosinkadink, comfyanonymous and guill as code owners April 10, 2026 09:31

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

Comment thread comfy_extras/nodes_cogvideox.py Outdated

Comment thread comfy_extras/nodes_cogvideox.py Outdated

Comment thread comfy/ldm/cogvideo/model.py

Comment thread comfy/ldm/cogvideo/model.py Outdated

Comment thread comfy/model_base.py

Comment thread comfy/supported_models.py

kijai and others added 2 commits April 10, 2026 20:28

Initial CogVideoX and SparkVSR support

3e4d15a

Remove breaking code, logging etc.

bf2c582

Talmaj force-pushed the cogvideox branch from c6a38ff to 1411d1b Compare April 10, 2026 18:28

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

Talmaj Marinc added 6 commits April 10, 2026 20:34

Remove sparkvsr related code.

a16fc7e

Utilize use_learned_positional_embeddings in forward pass of CogVideoX.

63739c3

Fix mutable input parameter.

955ce48

Fix CogVideoX concat_cond to handle temporal dimension and normalize …

8cf387b

…channel count

Add CogVideoX 1.5 geometry defaults to I2V path

ff916d8

Fixup ruff.

845da1b

Talmaj force-pushed the cogvideox branch from 1411d1b to 845da1b Compare April 10, 2026 18:34

Talmaj commented Apr 10, 2026

View reviewed changes

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

Talmaj commented Apr 10, 2026

View reviewed changes

Talmaj mentioned this pull request Apr 14, 2026

Cogvideox #13402

Open

Talmaj closed this Apr 14, 2026

		@@ -0,0 +1,485 @@
		# CogVideoX VAE - ported to ComfyUI native ops

Conversation

Talmaj commented Apr 10, 2026

Uh oh!

coderabbitai bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Talmaj Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Talmaj Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Talmaj commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Apr 10, 2026 •

edited

Loading