feat: Gemma4 text generation support (CORE-30) by kijai · Pull Request #13376 · Comfy-Org/ComfyUI

kijai · 2026-04-12T14:15:01Z

Adds support for Gemma4 models: E2B, E4B and 31B

This is mostly standalone as it includes new functionality:

video, and audio processing
KV sharing
per-layer input mechanism

This implementation was done by referencing the transformers version, and 100% parity in outputs was reached before any optimizations and ComfyUI specific changes, which are inevitable and do not degrade the quality, just bit different randomness from very minor things. However I left most of these in place for easier debugging when comparing to the reference.

To this goal I also added ways to use torch native sdpa GQA and scale, as I couldn't find a way to do those manually in a manner that doesn't lead to tiny numerical differences, as well as option to our RMS norm to easily disable the fused path for debugging/compatibility reasons.

In this process I also found some small mistakes in my Gemma3 text generation and included fixes to those as well, and unified the image embed scaling method with Gemma4.

Also including fp8 scaled embedding support to mixed precision ops, as the embed_token weights are huge in this model, and seem to work fine in fp8.

outputs can 100% match transformers with same sdpa flags, checkpoint this and then optimize

coderabbitai · 2026-04-12T14:27:47Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a Gemma4 multimodal text-encoder module with text/vision/audio tokenizers, model variants, and generation wrappers; extends SD text-encoder detection/loading to support Gemma4. Introduces TORCH_HAS_GQA and threads SDPA kwargs (scale, enable_gqa) through multiple attention implementations, including optional GQA expansion. Adds a fused flag and unfused path to RMS normalization. Updates Llama internals (RMSNorm usage, sliding-window trimming, generation now supporting initial_input_ids). Removes normalize_image_embeddings. Extends TextGenerate to accept video and audio. Adds MixedPrecisionOps.Embedding with quantized state_dict and fp8-aware forward logic.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.41% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main feature: adding Gemma4 text generation support, which aligns with the core objective of this PR.
Description check	✅ Passed	The description provides relevant context about Gemma4 model support, multimodal capabilities (video/audio), and related improvements to the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

comfy_extras/nodes_textgen.py (1)

35-49: ⚠️ Potential issue | 🔴 Critical

Update TextGenerateLTX2Prompt for the new parameters.

This signature change breaks the subclass override below. TextGenerateLTX2Prompt.execute() still calls super().execute(...) positionally, so its thinking argument now lands in video. With the default False, Gemma4 tokenization will treat that as a provided video input and blow up on False.movedim(...). The inherited video/audio schema also no longer matches the subclass signature.

Suggested fix

 class TextGenerateLTX2Prompt(TextGenerate):
@@
-    def execute(cls, clip, prompt, max_length, sampling_mode, image=None, thinking=False) -> io.NodeOutput:
+    def execute(cls, clip, prompt, max_length, sampling_mode, image=None, video=None, audio=None, thinking=False) -> io.NodeOutput:
         if image is None:
             formatted_prompt = f"<start_of_turn>system\n{LTX2_T2V_SYSTEM_PROMPT.strip()}<end_of_turn>\n<start_of_turn>user\nUser Raw Input Prompt: {prompt}.<end_of_turn>\n<start_of_turn>model\n"
         else:
             formatted_prompt = f"<start_of_turn>system\n{LTX2_I2V_SYSTEM_PROMPT.strip()}<end_of_turn>\n<start_of_turn>user\n\n<image_soft_token>\n\nUser Raw Input Prompt: {prompt}.<end_of_turn>\n<start_of_turn>model\n"
-        return super().execute(clip, formatted_prompt, max_length, sampling_mode, image, thinking)
+        return super().execute(
+            clip,
+            formatted_prompt,
+            max_length,
+            sampling_mode,
+            image=image,
+            video=video,
+            audio=audio,
+            thinking=thinking,
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/nodes_textgen.py` around lines 35 - 49, The subclass
TextGenerateLTX2Prompt.execute signature and its call to super() must be updated
to match the base class ordering and new optional inputs so the thinking boolean
doesn't end up in the video parameter; change TextGenerateLTX2Prompt.execute to
accept (cls, clip, prompt, max_length, sampling_mode, image=None, video=None,
audio=None, thinking=False) and call super().execute(...) using keyword
arguments (e.g., super().execute(clip=clip, prompt=prompt,
max_length=max_length, sampling_mode=sampling_mode, image=image, video=video,
audio=audio, thinking=thinking)); also update any input/output schema in
TextGenerateLTX2Prompt that referenced video/audio to match the inherited io
schema so clip.tokenize is passed the correct types.

🧹 Nitpick comments (1)

comfy/ldm/modules/attention.py (1)

227-230: Scale pre-scaling is correct; consider adding enable_gqa support for consistency.

The math to compensate for the internal 1/sqrt(dim_head) is correct. However, unlike attention_basic, this function doesn't handle enable_gqa. If enable_gqa=True is passed here, it's silently ignored, which could lead to incorrect results if this code path is exercised by GQA models in the future.

♻️ Optional: Add GQA support for consistency

 def attention_sub_quad(query, key, value, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
     attn_precision = get_attn_precision(attn_precision, query.dtype)
 
     if skip_reshape:
         b, _, _, dim_head = query.shape
     else:
         b, _, dim_head = query.shape
         dim_head //= heads
 
+    if kwargs.get("enable_gqa", False) and query.shape[-3] != key.shape[-3]:
+        n_rep = query.shape[-3] // key.shape[-3]
+        key = key.repeat_interleave(n_rep, dim=-3)
+        value = value.repeat_interleave(n_rep, dim=-3)
+
     if "scale" in kwargs:
         # Pre-scale query to match requested scale (cancels internal 1/sqrt(dim_head))
         query = query * (kwargs["scale"] * dim_head ** 0.5)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@comfy/ldm/modules/attention.py` around lines 227 - 230, The pre-scaling block
should also respect GQA settings: mirror the logic used in attention_basic by
checking kwargs.get("enable_gqa") and, when True, adjust the scale to account
for GQA group size before applying it to query (use kwargs.get("gqa_group_size",
1) to compute the additional sqrt factor together with dim_head so the
multiplication query = query * (...) properly cancels the internal
1/sqrt(dim_head) for grouped heads); update the block that currently uses
kwargs["scale"] and dim_head to incorporate enable_gqa and gqa_group_size.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/ldm/modules/attention.py`:
- Around line 512-516: The code builds sdpa_extra from kwargs and may pass
enable_gqa to comfy.ops.scaled_dot_product_attention causing a TypeError on
PyTorch <2.5; modify the sdpa_extra construction so it only includes
"enable_gqa" when the runtime PyTorch version check (torch_version_numeric) is
>= (2,5) (otherwise only include "scale"), or alternatively always route through
the existing mismatched-shape fallback logic (the block that handles unequal q/k
shapes) so enable_gqa is never passed to scaled_dot_product_attention on older
PyTorch; update the sdpa_extra creation near the SDP_BATCH_LIMIT branch and
references to kwargs so enable_gqa is gated by torch_version_numeric.

In `@comfy/sd.py`:
- Around line 1297-1298: detect_te_model() now returns TEModel.QWEN35_31B for
weight.shape[0]==5120 but TEModel lacks that member and the Qwen35 loader only
supports up to QWEN35_27B; add a new enum member TEModel.QWEN35_31B and extend
the Qwen35 loader branch (the function handling Qwen35 checkpoints / any loader
switch that currently handles QWEN35_7B/QWEN35_13B/QWEN35_27B) to recognize and
correctly load the 31B layout (map the 5120 projection size to the new enum and
implement the corresponding weight slicing/reshaping/parameter assignment logic
consistent with the 27B case). Ensure references to TEModel and the loader
switch (e.g., detect_te_model(), the Qwen35 loading function/class) are updated
so a 31B checkpoint follows the same loading pattern as other Qwen35 sizes.

In `@comfy/text_encoders/gemma4.py`:
- Around line 1074-1080: The tokenizer currently always builds audio_features
from the incoming audio in the block that creates waveform/sample_rate and calls
_extract_mel_spectrogram, which leads to invalid audio embed dicts for variants
that don't support audio (e.g., Gemma4_31B_Config built without
Gemma4AudioMixin). Update the logic in the method that contains this snippet to
first check the model/variant capability (e.g., presence of Gemma4AudioMixin on
the variant or audio_config on the variant instance prepared by _make_variant)
and only process audio when that capability exists; otherwise either raise a
clear error rejecting audio input for audio-less variants or explicitly set
audio_features=[] and skip emitting audio embeds. Reference the audio processing
symbols audio_features, _extract_mel_spectrogram, _make_variant,
Gemma4_31B_Config and Gemma4AudioMixin when making the guard.
- Around line 1006-1008: The Gemma4_Tokenizer.state_dict currently returns an
empty dict so tokenizer_json is dropped; change state_dict in class
Gemma4_Tokenizer to return the tokenizer payload (e.g. include 'tokenizer_json':
self.tokenizer_json) and add a complementary load_state_dict(self, sd) that
restores self.tokenizer_json (and any derived caches) from
sd.get('tokenizer_json') so CLIP.get_sd()/CLIPSave will persist and later
rebuild the tokenizer from tokenizer_json.

---

Outside diff comments:
In `@comfy_extras/nodes_textgen.py`:
- Around line 35-49: The subclass TextGenerateLTX2Prompt.execute signature and
its call to super() must be updated to match the base class ordering and new
optional inputs so the thinking boolean doesn't end up in the video parameter;
change TextGenerateLTX2Prompt.execute to accept (cls, clip, prompt, max_length,
sampling_mode, image=None, video=None, audio=None, thinking=False) and call
super().execute(...) using keyword arguments (e.g., super().execute(clip=clip,
prompt=prompt, max_length=max_length, sampling_mode=sampling_mode, image=image,
video=video, audio=audio, thinking=thinking)); also update any input/output
schema in TextGenerateLTX2Prompt that referenced video/audio to match the
inherited io schema so clip.tokenize is passed the correct types.

---

Nitpick comments:
In `@comfy/ldm/modules/attention.py`:
- Around line 227-230: The pre-scaling block should also respect GQA settings:
mirror the logic used in attention_basic by checking kwargs.get("enable_gqa")
and, when True, adjust the scale to account for GQA group size before applying
it to query (use kwargs.get("gqa_group_size", 1) to compute the additional sqrt
factor together with dim_head so the multiplication query = query * (...)
properly cancels the internal 1/sqrt(dim_head) for grouped heads); update the
block that currently uses kwargs["scale"] and dim_head to incorporate enable_gqa
and gqa_group_size.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 21b2cc8d-4d7a-48de-9afb-8ac74b566014

📥 Commits

Reviewing files that changed from the base of the PR and between 31283d2 and abff4cf.

📒 Files selected for processing (10)

comfy/ldm/modules/attention.py
comfy/rmsnorm.py
comfy/sd.py
comfy/text_encoders/gemma4.py
comfy/text_encoders/llama.py
comfy/text_encoders/lt.py
comfy/text_encoders/lumina2.py
comfy/text_encoders/qwen35.py
comfy/utils.py
comfy_extras/nodes_textgen.py

💤 Files with no reviewable changes (2)

comfy/text_encoders/qwen35.py
comfy/utils.py

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

comfy/text_encoders/gemma4.py (1)

1078-1084: ⚠️ Potential issue | 🟠 Major

Reject audio on audio-less Gemma4 variants.

Gemma4_31B_Config disables audio, and _make_variant() skips Gemma4AudioMixin for that model, but this tokenizer still emits audio embeds whenever audio is passed. On the 31B path those embeds have no consumer and will fail later in embed preprocessing.

Proposed fix

         # Process audio
         audio_features = []
         if audio is not None:
+            if not getattr(self, "supports_audio", True):
+                raise ValueError("This Gemma4 variant does not support audio inputs.")
             waveform = audio["waveform"].squeeze(0) if hasattr(audio, "__getitem__") else audio
             sample_rate = audio.get("sample_rate", 16000) if hasattr(audio, "get") else 16000
             mel, mel_mask = self._extract_mel_spectrogram(waveform, sample_rate)
             audio_features = [(mel.unsqueeze(0), mel_mask.unsqueeze(0))]  # ([1, T, 128], [1, T])

# in _make_variant()
class Tokenizer(Gemma4Tokenizer):
    supports_audio = audio

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@comfy/text_encoders/gemma4.py` around lines 1078 - 1084, The tokenizer
currently processes audio unconditionally, causing unsupported variants (e.g.,
Gemma4_31B_Config which omits Gemma4AudioMixin) to produce audio embeds that
later fail; update _make_variant to set a per-variant flag and gate audio
processing: in _make_variant() define the generated Tokenizer class to include
supports_audio = audio (or similar boolean) based on whether Gemma4AudioMixin
was included, then modify the audio handling branch in
Gemma4Tokenizer._encode/_process (the block that creates audio_features from
audio, waveform, sample_rate, _extract_mel_spectrogram) to early-return/ignore
or raise if audio is provided but supports_audio is False so audio is rejected
for audio-less variants.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/text_encoders/gemma4.py`:
- Around line 1094-1102: The video subsampling code in gemma4.py uses a
hardcoded default fps=24 which drops most frames; change the logic in the
is_video branch (the block that computes fps = kwargs.get("fps", 24), step and
indices, used by clip.tokenize called from TextGenerate.execute()) so it does
not assume 24 FPS—either use kwargs.get("fps", 1) as the default or only perform
subsampling when an explicit fps is provided; update the step calculation
accordingly (keep step = max(1, round(fps)) if using fps as the target frame
interval) so callers that don’t pass fps (and the node tooltip that expects 1
FPS batches) do not lose frames by default.
- Around line 1075-1076: The method tokenize_with_weights currently writes
request-scoped state to self by setting self.thinking, which can leak across
concurrent requests (see tokenize_with_weights and the later read that expects
this flag); make thinking a local variable instead of storing it on self: remove
the assignment to self.thinking, use a local thinking variable, and pass that
local variable into any internal helper calls or downstream functions that
previously relied on self.thinking (or change those helpers to accept a thinking
parameter) so no shared tokenizer/CLIP.clone() instances see or store mutable
request state on the object.
- Around line 624-628: The current dense one-hot construction (creating one_hot
from clamped pixel_position_ids and matmul with position_embedding_table) causes
a huge temporary allocation; replace it by directly gathering the needed rows
from position_embedding_table using the clamped pixel_position_ids. Concretely,
compute clamped_positions = pixel_position_ids.clamp(min=0), move/reshape
pos_table = comfy.model_management.cast_to_device(self.position_embedding_table,
hidden_states.device, hidden_states.dtype) and then gather embeddings for those
indices (e.g., use advanced indexing or torch.gather/torch.take_along_dim to
select the two position rows per patch) and reduce/sum across the position
dimension to produce position_embeddings — avoiding the one_hot -> matmul path
in methods that use pixel_position_ids, position_embedding_table,
position_embedding_size, pos_table, and position_embeddings.

---

Duplicate comments:
In `@comfy/text_encoders/gemma4.py`:
- Around line 1078-1084: The tokenizer currently processes audio
unconditionally, causing unsupported variants (e.g., Gemma4_31B_Config which
omits Gemma4AudioMixin) to produce audio embeds that later fail; update
_make_variant to set a per-variant flag and gate audio processing: in
_make_variant() define the generated Tokenizer class to include supports_audio =
audio (or similar boolean) based on whether Gemma4AudioMixin was included, then
modify the audio handling branch in Gemma4Tokenizer._encode/_process (the block
that creates audio_features from audio, waveform, sample_rate,
_extract_mel_spectrogram) to early-return/ignore or raise if audio is provided
but supports_audio is False so audio is rejected for audio-less variants.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0f9e93ce-8c6c-49ea-8b69-35d29f45cda7

📥 Commits

Reviewing files that changed from the base of the PR and between abff4cf and 0fc398a.

📒 Files selected for processing (4)

comfy/ldm/modules/attention.py
comfy/sd.py
comfy/text_encoders/gemma4.py
comfy_extras/nodes_textgen.py

🚧 Files skipped from review as they are similar to previous changes (2)

comfy_extras/nodes_textgen.py
comfy/ldm/modules/attention.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/ops.py`:
- Around line 1228-1244: The code uses weight.params.orig_dtype when computing
target_dtype but the QuantizedTensor API uses the _params attribute; update the
access in the embedding branch to use weight._params.orig_dtype (keep the
surrounding logic in the block that checks isinstance(weight, QuantizedTensor)
and uses cast_bias_weight / uncast_bias_weight and out_dtype) so target_dtype is
derived from weight._params.orig_dtype instead of weight.params.orig_dtype.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c89e6919-dbd7-4ca5-99a9-765e495945f3

📥 Commits

Reviewing files that changed from the base of the PR and between 0fc398a and ba3a484.

📒 Files selected for processing (1)

comfy/ops.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy_extras/nodes_textgen.py`:
- Around line 173-178: The prompt selection in execute currently tests only
image presence and uses LTX2_T2V_SYSTEM_PROMPT for text-only even when video is
provided; update the branch in execute so it treats video the same as image when
deciding between LTX2_T2V_SYSTEM_PROMPT and LTX2_I2V_SYSTEM_PROMPT (e.g., check
"if image is None and video is None" or "if not (image or video)" before
choosing LTX2_T2V_SYSTEM_PROMPT), ensuring formatted_prompt includes the visual
token path when either image or video is present and leaving the rest of the
call to super().execute unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 88905f6a-4cb7-4d95-88ae-db01ee13f1c8

📥 Commits

Reviewing files that changed from the base of the PR and between 387e8d8 and c857b6c.

📒 Files selected for processing (1)

comfy_extras/nodes_textgen.py

coderabbitai · 2026-04-12T18:29:12Z

+    def execute(cls, clip, prompt, max_length, sampling_mode, image=None, video=None, audio=None, thinking=False) -> io.NodeOutput:
        if image is None:
            formatted_prompt = f"<start_of_turn>system\n{LTX2_T2V_SYSTEM_PROMPT.strip()}<end_of_turn>\n<start_of_turn>user\nUser Raw Input Prompt: {prompt}.<end_of_turn>\n<start_of_turn>model\n"
        else:
            formatted_prompt = f"<start_of_turn>system\n{LTX2_I2V_SYSTEM_PROMPT.strip()}<end_of_turn>\n<start_of_turn>user\n\n<image_soft_token>\n\nUser Raw Input Prompt: {prompt}.<end_of_turn>\n<start_of_turn>model\n"
-        return super().execute(clip, formatted_prompt, max_length, sampling_mode, image, thinking)
+        return super().execute(clip, formatted_prompt, max_length, sampling_mode, image=image, video=video, audio=audio, thinking=thinking)


⚠️ Potential issue | 🟠 Major

Treat video as visual conditioning when selecting the LTX2 system prompt.

Line 174 still chooses the text-only prompt whenever image is None, even if video is present. That means a video-conditioned request gets formatted as plain text-only prompt enhancement, despite Line 178 forwarding the frames downstream.

Minimal fix

- if image is None: + if image is None and video is None:

At minimum, the branch needs to consider video too.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def execute(cls, clip, prompt, max_length, sampling_mode, image=None, video=None, audio=None, thinking=False) -> io.NodeOutput:

if image is None:

formatted_prompt = f"<start_of_turn>system\n{LTX2_T2V_SYSTEM_PROMPT.strip()}<end_of_turn>\n<start_of_turn>user\nUser Raw Input Prompt: {prompt}.<end_of_turn>\n<start_of_turn>model\n"

else:

formatted_prompt = f"<start_of_turn>system\n{LTX2_I2V_SYSTEM_PROMPT.strip()}<end_of_turn>\n<start_of_turn>user\n\n<image_soft_token>\n\nUser Raw Input Prompt: {prompt}.<end_of_turn>\n<start_of_turn>model\n"

return super().execute(clip, formatted_prompt, max_length, sampling_mode, image, thinking)

return super().execute(clip, formatted_prompt, max_length, sampling_mode, image=image, video=video, audio=audio, thinking=thinking)

def execute(cls, clip, prompt, max_length, sampling_mode, image=None, video=None, audio=None, thinking=False) -> io.NodeOutput:

if image is None and video is None:

formatted_prompt = f"<start_of_turn>system\n{LTX2_T2V_SYSTEM_PROMPT.strip()}<end_of_turn>\n<start_of_turn>user\nUser Raw Input Prompt: {prompt}.<end_of_turn>\n<start_of_turn>model\n"

else:

formatted_prompt = f"<start_of_turn>system\n{LTX2_I2V_SYSTEM_PROMPT.strip()}<end_of_turn>\n<start_of_turn>user\n\n<image_soft_token>\n\nUser Raw Input Prompt: {prompt}.<end_of_turn>\n<start_of_turn>model\n"

return super().execute(clip, formatted_prompt, max_length, sampling_mode, image=image, video=video, audio=audio, thinking=thinking)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy_extras/nodes_textgen.py` around lines 173 - 178, The prompt selection in execute currently tests only image presence and uses LTX2_T2V_SYSTEM_PROMPT for text-only even when video is provided; update the branch in execute so it treats video the same as image when deciding between LTX2_T2V_SYSTEM_PROMPT and LTX2_I2V_SYSTEM_PROMPT (e.g., check "if image is None and video is None" or "if not (image or video)" before choosing LTX2_T2V_SYSTEM_PROMPT), ensuring formatted_prompt includes the visual token path when either image or video is present and leaving the rest of the call to super().execute unchanged.

jovan2009 · 2026-04-13T00:28:16Z

Thanks for this! @kijai
I have one feature request already, LOL. I think would it be useful to have a way to enter a system prompt, if that's possible. Different video models specify different system prompts when it comes to their guidelines about prompt enhancing. I asked Google's AI if I would get the same result if I paste the system prompt inside the first prompt of a local LLM and it said that it's a good alternative but it's not quite the same.

So it works with basic attention

kijai added 8 commits April 3, 2026 03:46

initial gemma4 support

832753f

parity with reference implementation

93e8635

outputs can 100% match transformers with same sdpa flags, checkpoint this and then optimize

Cleanup, video fixes

05eacea

cleanup, enable fused rms norm by default

6718be0

update comment

6b803ab

Cleanup

17ccff2

Merge remote-tracking branch 'upstream/master' into gemma4

411b26e

Update sd.py

abff4cf

kijai requested review from Kosinkadink, comfyanonymous and guill as code owners April 12, 2026 14:15

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

Comment thread comfy/ldm/modules/attention.py

Comment thread comfy/sd.py Outdated

Comment thread comfy/text_encoders/gemma4.py

Comment thread comfy/text_encoders/gemma4.py

kijai added 2 commits April 12, 2026 18:11

Various fixes

0fc398a

Add fp8 scaled embedding support

ba3a484

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

Comment thread comfy/text_encoders/gemma4.py

Comment thread comfy/text_encoders/gemma4.py Outdated

Comment thread comfy/text_encoders/gemma4.py

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

Comment thread comfy/ops.py

kijai added 2 commits April 12, 2026 18:27

small fixes

387e8d8

Translate think tokens

c857b6c

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

kijai added 3 commits April 13, 2026 23:29

Fix image encoder attention mask type

e0cccbd

So it works with basic attention

Handle thinking tokens different only for Gemma4

845eb14

Merge remote-tracking branch 'upstream/master' into gemma4

35987d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Gemma4 text generation support (CORE-30)#13376

feat: Gemma4 text generation support (CORE-30)#13376
kijai wants to merge 15 commits intoComfy-Org:masterfrom
kijai:gemma4

kijai commented Apr 12, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 12, 2026 •

edited

Loading

Reviews paused

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 12, 2026

Uh oh!

jovan2009 commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kijai commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

jovan2009 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kijai commented Apr 12, 2026 •

edited

Loading

coderabbitai bot commented Apr 12, 2026 •

edited

Loading

jovan2009 commented Apr 13, 2026 •

edited

Loading