Conversation
📝 WalkthroughWalkthroughAdds full CogVideoX support across the codebase: a new latent format, a Comfy-native 3D transformer (CogVideoXTransformer3DModel) with patch/positional/rotary utilities, two CogVideoX-style VAEs (including a streaming/causal implementation), a V_PREDICTION_DDPM sampling class, checkpoint-detection logic for CogVideoX VAEs/UNets, a new ModelType and CogVideoX BaseModel wrapper, two supported model entries (CogVideoX_T2V and CogVideoX_I2V), and a minor extras import formatting tweak. 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
comfy/ldm/cogvideo/vae.py (2)
166-193: Useops.Conv2dfor consistency with ComfyUI's dtype handling.
Downsample3Dusesnn.Conv2ddirectly (line 170) while the rest of the module usesops.Conv3dfromcomfy.ops.disable_weight_init. This bypasses ComfyUI's weight casting mechanism, which could cause dtype mismatches when running in mixed precision modes (e.g., fp16 input with fp32 weights).Suggested fix
class Downsample3D(nn.Module): """3D downsampling with optional temporal compression.""" def __init__(self, in_channels, out_channels, kernel_size=3, stride=2, padding=0, compress_time=False): super().__init__() - self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding) + self.conv = ops.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding) self.compress_time = compress_time🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy/ldm/cogvideo/vae.py` around lines 166 - 193, Downsample3D currently constructs self.conv with nn.Conv2d which bypasses ComfyUI's weight casting; replace that instantiation in Downsample3D.__init__ with ops.Conv2d (from comfy.ops.disable_weight_init) using the same arguments (in_channels, out_channels, kernel_size, stride, padding) so weights participate in ComfyUI dtype handling; ensure the module imports or references comfy.ops.disable_weight_init.ops (or the same ops alias used elsewhere in the file) and keep the attribute name self.conv unchanged.
196-226: Same issue: useops.Conv2dfor consistency.
Upsample3Dalso usesnn.Conv2ddirectly (line 200). Apply the same fix for dtype handling consistency.Suggested fix
class Upsample3D(nn.Module): """3D upsampling with optional temporal decompression.""" def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1, compress_time=False): super().__init__() - self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding) + self.conv = ops.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding) self.compress_time = compress_time🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy/ldm/cogvideo/vae.py` around lines 196 - 226, The Upsample3D class currently constructs self.conv with nn.Conv2d which bypasses the project's dtype-aware conv wrapper; replace that instantiation with ops.Conv2d (i.e., change the constructor call in Upsample3D.__init__ from nn.Conv2d(...) to ops.Conv2d(...)) so the module uses the same dtype/dispatch behavior as other layers; ensure ops is imported/available in the module and no other call sites in Upsample3D (e.g., references to self.conv in forward) need changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy_extras/nodes_cogvideox.py`:
- Around line 98-105: The code treats any ref_guidance_scale != 1.0 as a boolean
and always uses zeros for the negative conditioning; change this so the negative
(and optionally positive) conditioning actually reflects ref_guidance_scale
instead of always zeroing: in the block that checks ref_guidance_scale and
ref_frames, keep positive = node_helpers.conditioning_set_values(positive,
{"concat_latent_image": ref_latent}) but set negative =
node_helpers.conditioning_set_values(negative, {"concat_latent_image":
ref_latent * (1.0 - float(ref_guidance_scale))}) (or use a clamped interpolation
between torch.zeros_like(ref_latent) and ref_latent based on ref_guidance_scale)
so varying ref_guidance_scale produces proportional conditioning; reference
symbols: ref_guidance_scale, ref_frames, ref_latent, positive, negative,
node_helpers.conditioning_set_values.
- Around line 52-56: The created latent uses UI batch_size (variable latent) but
the function later returns lq_latent produced by vae.encode (symbols: latent,
lq_latent, vae.encode, batch_size, video), so UI batch_size is ignored; fix by
ensuring lq_latent matches batch_size before returning—either repeat/expand
lq_latent along dim 0 to batch_size (or repeat the input video before calling
vae.encode) when lq_latent.shape[0] != batch_size so downstream receives a
tensor with the requested batch size.
In `@comfy/ldm/cogvideo/model.py`:
- Around line 131-156: forward() currently rebuilds sin/cos positional
embeddings instead of using the stored buffer, so learned positional checkpoints
in self.pos_embedding are ignored; update forward() to prefer reading the
registered buffer self.pos_embedding when use_learned_positional_embeddings (or
when self.pos_embedding exists) and only fall back to calling
_get_positional_embeddings when the buffer is missing or shape-mismatched.
Ensure the buffer registered in register_buffer("pos_embedding", ...) (and the
joint_pos_embedding layout with self.max_text_seq_length) is used unchanged (or
interpolated if spatial/temporal scales differ) so learned positions persist
across loads; also apply the same fix to the analogous code path referenced at
lines 181-198 that rebuilds sin/cos embeddings.
- Line 259: The forward method uses a mutable default parameter
transformer_options={} which can leak state; change the signature of forward to
use transformer_options=None and inside the method set transformer_options = {}
if it's None (or create a shallow copy when provided) so each call gets its own
dict; update any type hints (e.g., Optional[dict]) and ensure subsequent code
(and the patcher_extension that calls .setdefault()) operates on this per-call
dict rather than a shared default.
In `@comfy/model_base.py`:
- Around line 1975-1995: concat_cond currently only matches H/W and batch but
drops any temporal dimension and doesn't ensure the returned concat tensor has
exactly extra_channels, which breaks video latents and causes channel-mismatch
in _apply_model(); fix by preserving/aligning a time axis (if present) the same
way BaseModel.concat_cond does (trim or pad the temporal dimension to match
noise), apply utils.common_upscale and process_latent_in per-latent-chunk as you
already do, then normalize the channel count to extra_channels by trimming or
repeating/tiling processed latent chunks (using latent_format.latent_channels
and process_latent_in) so the final tensor shape matches (batch, extra_channels,
[time,] H, W), ensure dtype/layout/device match noise, and keep using
utils.resize_to_batch_size to match batch size before returning.
In `@comfy/supported_models.py`:
- Around line 1797-1799: CogVideoX_I2V.get_model currently constructs
model_base.CogVideoX directly and skips the patch_size_t handling implemented in
CogVideoX_T2V.get_model, causing 1.5 checkpoints to use the transformer's
fallback geometry; copy the same logic from CogVideoX_T2V.get_model into
CogVideoX_I2V.get_model so that you detect 1.5-style checkpoints and set the
patch_size_t / sample_height/sample_width/sample_frames defaults (the 96/170/81
geometry) before instantiating model_base.CogVideoX (or pass the computed
patch_size_t into model_base.CogVideoX), ensuring both I2V and T2V paths create
the same positional geometry.
---
Nitpick comments:
In `@comfy/ldm/cogvideo/vae.py`:
- Around line 166-193: Downsample3D currently constructs self.conv with
nn.Conv2d which bypasses ComfyUI's weight casting; replace that instantiation in
Downsample3D.__init__ with ops.Conv2d (from comfy.ops.disable_weight_init) using
the same arguments (in_channels, out_channels, kernel_size, stride, padding) so
weights participate in ComfyUI dtype handling; ensure the module imports or
references comfy.ops.disable_weight_init.ops (or the same ops alias used
elsewhere in the file) and keep the attribute name self.conv unchanged.
- Around line 196-226: The Upsample3D class currently constructs self.conv with
nn.Conv2d which bypasses the project's dtype-aware conv wrapper; replace that
instantiation with ops.Conv2d (i.e., change the constructor call in
Upsample3D.__init__ from nn.Conv2d(...) to ops.Conv2d(...)) so the module uses
the same dtype/dispatch behavior as other layers; ensure ops is
imported/available in the module and no other call sites in Upsample3D (e.g.,
references to self.conv in forward) need changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: faae2daf-0395-4cd1-a6bd-bb690298ec44
📒 Files selected for processing (13)
comfy/latent_formats.pycomfy/ldm/cogvideo/__init__.pycomfy/ldm/cogvideo/model.pycomfy/ldm/cogvideo/vae.pycomfy/ldm/cogvideo/vae_backup.pycomfy/model_base.pycomfy/model_detection.pycomfy/model_sampling.pycomfy/sd.pycomfy/supported_models.pycomfy_extras/nodes_cogvideox.pyconvert_sparkvsr_to_comfy.pynodes.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy/model_base.py`:
- Around line 2018-2023: The caller-provided ofs from kwargs may be a scalar
(int/float) and is passed unchanged to comfy.conds.CONDRegular, causing later
tensor operations to fail; in the branch inside the if
self.diffusion_model.ofs_proj_dim is not None block, coerce/of convert
kwargs.get("ofs") into a torch tensor matching the batch size/device/dtype of
noise (like using torch.full or torch.tensor with device=noise.device and
dtype=noise.dtype and shape (noise.shape[0],)), before assigning out['ofs'] =
comfy.conds.CONDRegular(ofs); ensure you still handle None by falling back to
the existing default creation path and preserve dtype/device from noise.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7611f40f-e37d-45d2-96e5-7141a7870101
📒 Files selected for processing (5)
comfy/ldm/cogvideo/model.pycomfy/model_base.pycomfy/model_detection.pycomfy/sd.pycomfy/supported_models.py
🚧 Files skipped from review as they are similar to previous changes (1)
- comfy/model_detection.py
| if self.diffusion_model.ofs_proj_dim is not None: | ||
| ofs = kwargs.get("ofs", None) | ||
| if ofs is None: | ||
| noise = kwargs.get("noise", None) | ||
| ofs = torch.full((noise.shape[0],), 2.0, device=noise.device, dtype=noise.dtype) | ||
| out['ofs'] = comfy.conds.CONDRegular(ofs) |
There was a problem hiding this comment.
Convert caller-provided ofs values to a tensor before wrapping them.
The default path creates a tensor, but Line 2019 accepts raw kwargs unchanged. If a workflow passes ofs as a Python float/int, CONDRegular later treats it like a tensor during batching/concat and that override path breaks.
Proposed fix
if self.diffusion_model.ofs_proj_dim is not None:
+ noise = kwargs.get("noise", None)
ofs = kwargs.get("ofs", None)
if ofs is None:
- noise = kwargs.get("noise", None)
ofs = torch.full((noise.shape[0],), 2.0, device=noise.device, dtype=noise.dtype)
+ else:
+ ofs = torch.as_tensor(ofs, device=noise.device, dtype=noise.dtype).reshape(-1)
out['ofs'] = comfy.conds.CONDRegular(ofs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/model_base.py` around lines 2018 - 2023, The caller-provided ofs from
kwargs may be a scalar (int/float) and is passed unchanged to
comfy.conds.CONDRegular, causing later tensor operations to fail; in the branch
inside the if self.diffusion_model.ofs_proj_dim is not None block, coerce/of
convert kwargs.get("ofs") into a torch tensor matching the batch
size/device/dtype of noise (like using torch.full or torch.tensor with
device=noise.device and dtype=noise.dtype and shape (noise.shape[0],)), before
assigning out['ofs'] = comfy.conds.CONDRegular(ofs); ensure you still handle
None by falling back to the existing default creation path and preserve
dtype/device from noise.
| @@ -0,0 +1,485 @@ | |||
| # CogVideoX VAE - ported to ComfyUI native ops | |||
There was a problem hiding this comment.
@kijai I assume this backup file can also just be removed?
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
comfy/model_base.py (1)
2018-2023:⚠️ Potential issue | 🟠 MajorCoerce caller-provided
ofsto a tensor before wrapping it.The default branch builds a tensor, but a user-supplied
ofscan still be a Pythonint/floathere. That value then reachesCogVideoXTransformer3DModel.forward()unchanged, where the OFS path feeds it into timestep embedding and can fail at runtime. Normalizing it to a tensor onnoise.device/noise.dtypekeeps the new conditioning input compatible with scalar workflow values.🛠️ Proposed fix
def extra_conds(self, **kwargs): out = super().extra_conds(**kwargs) # OFS embedding (CogVideoX 1.5 I2V), default 2.0 as used by SparkVSR if self.diffusion_model.ofs_proj_dim is not None: + noise = kwargs.get("noise", None) ofs = kwargs.get("ofs", None) if ofs is None: - noise = kwargs.get("noise", None) ofs = torch.full((noise.shape[0],), 2.0, device=noise.device, dtype=noise.dtype) + else: + ofs = torch.as_tensor(ofs, device=noise.device, dtype=noise.dtype).reshape(-1) out['ofs'] = comfy.conds.CONDRegular(ofs) return out🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy/model_base.py` around lines 2018 - 2023, The caller-provided ofs may be a Python scalar and needs coercion to a tensor before wrapping in comfy.conds.CONDRegular to avoid runtime failures in CogVideoXTransformer3DModel.forward(); update the block that reads ofs (when self.diffusion_model.ofs_proj_dim is not None) so that if kwargs.get("ofs") is not None you convert it to a torch tensor on noise.device and noise.dtype (use torch.as_tensor or torch.tensor), and if it is a scalar expand it to shape (noise.shape[0],) to match the default branch, then assign out['ofs'] = comfy.conds.CONDRegular(ofs).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy/ldm/cogvideo/vae.py`:
- Around line 453-467: The encode method accepts arbitrary frame counts which
can break the 1+4k temporal stride contract; before any batching logic in encode
(function comfy/ldm/cogvideo/vae.py::encode), clamp/truncate the input tensor x
along the temporal dimension to the largest prefix that satisfies valid_t = 1 +
4*k where k = (t-1)//4 (ensure at least 1 frame), e.g. compute t = x.shape[2],
valid_t = 1 + 4*((t-1)//4) and if valid_t < t replace x with x[:, :, :valid_t]
so downstream chunking/decoding round-trips cleanly; keep the rest of the encode
flow (frame batching, conv_cache, enc accumulation) unchanged.
In `@comfy/model_sampling.py`:
- Around line 62-76: calculate_input currently returns raw noise but must
construct DDPM x_t in normalized space; change calculate_input to return
latent_image + sigma * noise (using reshape_sigma as needed). Fix noise_scaling
so it only scales the noise by sigma (do not use sqrt(1+sigma**2)) and only add
latent_image when producing x_t; remove the max_denoise branch applying
torch.sqrt(1.0 + sigma**2). Leave calculate_denoised as-is (it expects
model_input to be DDPM x_t) and ensure all references use the reshaped sigma
variable.
In `@nodes.py`:
- Around line 2460-2461: init_builtin_extra_nodes() currently lists
"nodes_cogvideox.py" which doesn't exist and causes CI/startup failures; either
remove the missing entry from the built-in extra nodes list (so only keep
existing entries like "nodes_rtdetr.py") or add the actual
comfy_extras/nodes_cogvideox.py module with its comfy_entrypoint()/extension
registration in the same PR; update the list referenced by
init_builtin_extra_nodes() to only include files that are present or ensure the
new module is added before registering it.
---
Duplicate comments:
In `@comfy/model_base.py`:
- Around line 2018-2023: The caller-provided ofs may be a Python scalar and
needs coercion to a tensor before wrapping in comfy.conds.CONDRegular to avoid
runtime failures in CogVideoXTransformer3DModel.forward(); update the block that
reads ofs (when self.diffusion_model.ofs_proj_dim is not None) so that if
kwargs.get("ofs") is not None you convert it to a torch tensor on noise.device
and noise.dtype (use torch.as_tensor or torch.tensor), and if it is a scalar
expand it to shape (noise.shape[0],) to match the default branch, then assign
out['ofs'] = comfy.conds.CONDRegular(ofs).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a55b5298-f1ec-4fc9-af41-c44255925312
📒 Files selected for processing (11)
comfy/latent_formats.pycomfy/ldm/cogvideo/__init__.pycomfy/ldm/cogvideo/model.pycomfy/ldm/cogvideo/vae.pycomfy/ldm/cogvideo/vae_backup.pycomfy/model_base.pycomfy/model_detection.pycomfy/model_sampling.pycomfy/sd.pycomfy/supported_models.pynodes.py
✅ Files skipped from review due to trivial changes (1)
- comfy/latent_formats.py
🚧 Files skipped from review as they are similar to previous changes (1)
- comfy/sd.py
| def encode(self, x): | ||
| t = x.shape[2] | ||
| frame_batch = self.num_sample_frames_batch_size | ||
| num_batches = max(t // frame_batch, 1) | ||
| conv_cache = None | ||
| enc = [] | ||
| for i in range(num_batches): | ||
| remaining = t % frame_batch | ||
| start = frame_batch * i + (0 if i == 0 else remaining) | ||
| end = frame_batch * (i + 1) + remaining | ||
| chunk, conv_cache = self.encoder(x[:, :, start:end], conv_cache=conv_cache) | ||
| enc.append(chunk.to(x.device)) | ||
| enc = torch.cat(enc, dim=2) | ||
| mean, _ = enc.chunk(2, dim=1) | ||
| return mean |
There was a problem hiding this comment.
Clamp input videos to the VAE’s valid temporal stride before encoding.
The active CogVideoX wrapper in comfy/sd.py assumes a 1 + 4k temporal contract (downscale_ratio=(a+3)//4, upscale_ratio=a*4-3), but encode() currently accepts arbitrary frame counts. That means non-conformant clips can produce latent lengths that do not round-trip cleanly through decode/tiled paths.
Suggested fix
def encode(self, x):
+ valid_frames = max(
+ 1,
+ 1 + ((x.shape[2] - 1) // self.temporal_compression_ratio) * self.temporal_compression_ratio,
+ )
+ x = x[:, :, :valid_frames]
t = x.shape[2]
frame_batch = self.num_sample_frames_batch_sizeBased on learnings: In comfy/ldm/lightricks/vae/causal_video_autoencoder.py, replacing hard failures with silent truncation to the valid temporal-stride prefix was intentional and strictly better than exception-raising.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/ldm/cogvideo/vae.py` around lines 453 - 467, The encode method accepts
arbitrary frame counts which can break the 1+4k temporal stride contract; before
any batching logic in encode (function comfy/ldm/cogvideo/vae.py::encode),
clamp/truncate the input tensor x along the temporal dimension to the largest
prefix that satisfies valid_t = 1 + 4*k where k = (t-1)//4 (ensure at least 1
frame), e.g. compute t = x.shape[2], valid_t = 1 + 4*((t-1)//4) and if valid_t <
t replace x with x[:, :, :valid_t] so downstream chunking/decoding round-trips
cleanly; keep the rest of the encode flow (frame batching, conv_cache, enc
accumulation) unchanged.
| def calculate_input(self, sigma, noise): | ||
| return noise | ||
|
|
||
| def calculate_denoised(self, sigma, model_output, model_input): | ||
| sigma = reshape_sigma(sigma, model_output.ndim) | ||
| return model_input / (sigma ** 2 + 1.0) ** 0.5 - model_output * sigma / (sigma ** 2 + 1.0) ** 0.5 | ||
|
|
||
| def noise_scaling(self, sigma, noise, latent_image, max_denoise=False): | ||
| sigma = reshape_sigma(sigma, noise.ndim) | ||
| if max_denoise: | ||
| noise = noise * torch.sqrt(1.0 + sigma ** 2.0) | ||
| else: | ||
| noise = noise * sigma | ||
| noise += latent_image | ||
| return noise |
There was a problem hiding this comment.
Construct DDPM x_t in normalized space, not K-diffusion space.
calculate_input()/calculate_denoised() assume model_input is raw DDPM x_t, but noise_scaling() still returns latent_image + sigma * noise. That injects an extra sqrt(1 + sigma^2) factor, so CogVideoX gets the wrong input distribution and high-sigma steps are badly mis-scaled.
Suggested fix
def noise_scaling(self, sigma, noise, latent_image, max_denoise=False):
sigma = reshape_sigma(sigma, noise.ndim)
- if max_denoise:
- noise = noise * torch.sqrt(1.0 + sigma ** 2.0)
- else:
- noise = noise * sigma
- noise += latent_image
- return noise
+ return (latent_image + noise * sigma) / torch.sqrt(1.0 + sigma ** 2.0)Because CogVideoX defaults to ModelType.V_PREDICTION_DDPM, this one propagates straight into the new model path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/model_sampling.py` around lines 62 - 76, calculate_input currently
returns raw noise but must construct DDPM x_t in normalized space; change
calculate_input to return latent_image + sigma * noise (using reshape_sigma as
needed). Fix noise_scaling so it only scales the noise by sigma (do not use
sqrt(1+sigma**2)) and only add latent_image when producing x_t; remove the
max_denoise branch applying torch.sqrt(1.0 + sigma**2). Leave calculate_denoised
as-is (it expects model_input to be DDPM x_t) and ensure all references use the
reshaped sigma variable.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy/ldm/cogvideo/model.py`:
- Around line 531-542: The temporal RoPE currently always uses torch.arange(t)
causing sample_frames to be ignored in 1.5 "slice" mode; update the branch where
self.patch_size_t is not None to mirror the spatial logic: read the trained base
temporal length from self.patch_embed (e.g. self.patch_embed.sample_frames or
the appropriate attribute), compute max_t = max(base_t, t), build grid_t =
torch.arange(max_t, device=device, dtype=torch.float32) and then slice grid_t =
grid_t[:t] (or otherwise index down to runtime t), and apply the same change in
the second occurrence mentioned around the 556-570 block so temporal positions
follow the trained base timeline before slicing.
In `@comfy/model_detection.py`:
- Around line 531-537: The current logic treats a missing pos_embedding as
evidence to enable rotary embeddings; instead implement a three-way decision: if
pos_key ('{}patch_embed.pos_embedding'.format(key_prefix)) is present set
dit_config["use_learned_positional_embeddings"]=True and use_rotary=False; else
detect a temporal-patching / 1.5-style layout by scanning state_dict_keys for
known temporal/1.5 markers (e.g. any key under the same patch_embed prefix
containing 'temporal', 'time', or 'temporal_patch' or other
CogVideoXPatchEmbed-specific temporal names) and in that case set
use_rotary_positional_embeddings=True and
use_learned_positional_embeddings=False; otherwise set both
use_rotary_positional_embeddings=False and
use_learned_positional_embeddings=False. Reference pos_key, state_dict_keys,
dit_config and CogVideoXPatchEmbed when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2a1f5479-f3e5-409d-bda1-70b3eab086b7
📒 Files selected for processing (6)
comfy/ldm/cogvideo/model.pycomfy/model_base.pycomfy/model_detection.pycomfy/sd.pycomfy/supported_models.pynodes.py
✅ Files skipped from review due to trivial changes (1)
- nodes.py
🚧 Files skipped from review as they are similar to previous changes (2)
- comfy/sd.py
- comfy/model_base.py
| if self.patch_size_t is not None: | ||
| # CogVideoX 1.5: "slice" mode — positions are simple integer indices | ||
| # Compute at max(sample_size, actual_size) then slice to actual | ||
| base_h = self.patch_embed.sample_height // self.patch_size | ||
| base_w = self.patch_embed.sample_width // self.patch_size | ||
| max_h = max(base_h, h) | ||
| max_w = max(base_w, w) | ||
|
|
||
| grid_h = torch.arange(max_h, device=device, dtype=torch.float32) | ||
| grid_w = torch.arange(max_w, device=device, dtype=torch.float32) | ||
| grid_t = torch.arange(t, device=device, dtype=torch.float32) | ||
| else: |
There was a problem hiding this comment.
sample_frames is ignored in 1.5 rotary mode.
Lines 531-542 preserve the trained base grid for height/width, but temporal positions always come from torch.arange(t). That means the sample_frames=81 defaults set for 1.5 checkpoints in comfy/supported_models.py never affect temporal RoPE, so shorter clips get different temporal indexing than the model was trained with.
Please make the temporal axis follow the same slice-mode rule as H/W: build the temporal grid from the configured 1.5 base timeline, then slice down to the runtime clip length. Right now only the spatial axes keep the trained base geometry.
Also applies to: 556-570
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/ldm/cogvideo/model.py` around lines 531 - 542, The temporal RoPE
currently always uses torch.arange(t) causing sample_frames to be ignored in 1.5
"slice" mode; update the branch where self.patch_size_t is not None to mirror
the spatial logic: read the trained base temporal length from self.patch_embed
(e.g. self.patch_embed.sample_frames or the appropriate attribute), compute
max_t = max(base_t, t), build grid_t = torch.arange(max_t, device=device,
dtype=torch.float32) and then slice grid_t = grid_t[:t] (or otherwise index down
to runtime t), and apply the same change in the second occurrence mentioned
around the 556-570 block so temporal positions follow the trained base timeline
before slicing.
| pos_key = '{}patch_embed.pos_embedding'.format(key_prefix) | ||
| if pos_key in state_dict_keys: | ||
| dit_config["use_learned_positional_embeddings"] = True | ||
| dit_config["use_rotary_positional_embeddings"] = False | ||
| else: | ||
| dit_config["use_learned_positional_embeddings"] = False | ||
| dit_config["use_rotary_positional_embeddings"] = True |
There was a problem hiding this comment.
Don’t infer RoPE from a missing patch_embed.pos_embedding key.
This fallback conflates two different cases. In comfy/ldm/cogvideo/model.py, CogVideoXPatchEmbed only persists pos_embedding when use_learned_positional_embeddings is true, so fixed sin/cos checkpoints also omit that key. With the current else, those 1.0-style checkpoints get loaded as use_rotary_positional_embeddings=True, which changes token positions at inference.
A safer split here is:
pos_embeddingpresent → learned positional embeddings- temporal patching / 1.5-style layout detected → rotary positional embeddings
- otherwise → fixed sin/cos (
use_rotary_positional_embeddings=False,use_learned_positional_embeddings=False)
Suggested fix
pos_key = '{}patch_embed.pos_embedding'.format(key_prefix)
if pos_key in state_dict_keys:
dit_config["use_learned_positional_embeddings"] = True
dit_config["use_rotary_positional_embeddings"] = False
- else:
+ elif dit_config.get("patch_size_t") is not None:
dit_config["use_learned_positional_embeddings"] = False
dit_config["use_rotary_positional_embeddings"] = True
+ else:
+ dit_config["use_learned_positional_embeddings"] = False
+ dit_config["use_rotary_positional_embeddings"] = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/model_detection.py` around lines 531 - 537, The current logic treats a
missing pos_embedding as evidence to enable rotary embeddings; instead implement
a three-way decision: if pos_key
('{}patch_embed.pos_embedding'.format(key_prefix)) is present set
dit_config["use_learned_positional_embeddings"]=True and use_rotary=False; else
detect a temporal-patching / 1.5-style layout by scanning state_dict_keys for
known temporal/1.5 markers (e.g. any key under the same patch_embed prefix
containing 'temporal', 'time', or 'temporal_patch' or other
CogVideoXPatchEmbed-specific temporal names) and in that case set
use_rotary_positional_embeddings=True and
use_learned_positional_embeddings=False; otherwise set both
use_rotary_positional_embeddings=False and
use_learned_positional_embeddings=False. Reference pos_key, state_dict_keys,
dit_config and CogVideoXPatchEmbed when making the change.
| out = model_base.CogVideoX(self, image_to_video=True, device=device) | ||
| return out | ||
|
|
||
| models = [LotusD, Stable_Zero123, SD15_instructpix2pix, SD15, SD20, SD21UnclipL, SD21UnclipH, SDXL_instructpix2pix, SDXLRefiner, SDXL, SSD1B, KOALA_700M, KOALA_1B, Segmind_Vega, SD_X4Upscaler, Stable_Cascade_C, Stable_Cascade_B, SV3D_u, SV3D_p, SD3, StableAudio, AuraFlow, PixArtAlpha, PixArtSigma, HunyuanDiT, HunyuanDiT1, FluxInpaint, Flux, LongCatImage, FluxSchnell, GenmoMochi, LTXV, LTXAV, HunyuanVideo15_SR_Distilled, HunyuanVideo15, HunyuanImage21Refiner, HunyuanImage21, HunyuanVideoSkyreelsI2V, HunyuanVideoI2V, HunyuanVideo, CosmosT2V, CosmosI2V, CosmosT2IPredict2, CosmosI2VPredict2, ZImagePixelSpace, ZImage, Lumina2, WAN22_T2V, WAN21_T2V, WAN21_I2V, WAN21_FunControl2V, WAN21_Vace, WAN21_Camera, WAN22_Camera, WAN22_S2V, WAN21_HuMo, WAN22_Animate, WAN21_FlowRVS, WAN21_SCAIL, Hunyuan3Dv2mini, Hunyuan3Dv2, Hunyuan3Dv2_1, HiDream, Chroma, ChromaRadiance, ACEStep, ACEStep15, Omnigen2, QwenImage, Flux2, Kandinsky5Image, Kandinsky5, Anima, RT_DETR_v4, CogVideoX_I2V, CogVideoX_T2V] |
There was a problem hiding this comment.
@comfyanonymous @Kosinkadink I think that this list should be formatted in a way that each element goes into a separate line. In this way the diff is very clear. Right now with every change, the whole array is shown as changed.
Why I mention this? When rebasing or merging with master one could easily remove a model that was already added if not careful.
What do you think?
|
Closing for this one: #13402 |
This is a pre-requirement PR for CORE-38, netflix's VOID model.
It was written by @kijai and rebased by myself. I've tested and used it only with VOID model and it works well.