ComfyUI custom node that reuses the KJNodes Ideogram 4 visual box editor pattern for Z-Image-Turbo regional prompting.
This node is usable for Z-Image-Turbo regional prompting and image-to-image regional edits, but it cannot make Z-Image-Turbo behave exactly like Ideogram's native regional API. ZIT was not built specifically for hard box-constrained regional generation, so final adherence still depends on the model, prompt, mask size, denoise, and workflow.
Z-Image-Turbo Region Builder KJ
Outputs:
positive: the positive Z-Image conditioning. Connect this to the sampler positive input.negative: the negative Z-Image conditioning. Connect this to the sampler negative input.latent_with_noise_mask: latent output for the sampler. For text-to-image it is an empty latent. For image-to-image, ifimageandvaeare connected, it is the encoded source image latent withcombined_maskattached asnoise_mask.combined_mask: one mask made from all drawn boxes. White/bright areas are editable, black areas should stay preserved. Usually only needed if your workflow has a separate mask/inpaint input.region_masks: separate masks for every drawn box, shapeN,H,W. Mostly for debugging, previewing, or advanced workflows that process each region separately.source_image: the input image passed through. If no image is connected, this is a blank image. Optional; use it only if another node needs the original image.preview: visual preview image with boxes and mask tint. Optional; useful withPreview Image.regions_json: debug text showing parsed regions and the final composed positive prompt. Optional; useful for checking what the node actually sent to CLIP.bboxes: bounding boxes in pixel coordinates. Optional; only useful for nodes that accept ComfyUIBBOXdata.width/height: passthrough image dimensions. Optional; useful if another node needs the same size.
- Put this folder in
ComfyUI/custom_nodes/ZIT-Ideogram. - Restart ComfyUI.
- Add node:
ZIT-Ideogram/Z-Image > Z-Image-Turbo Region Builder KJ.
No extra Python packages are required.
Use ComfyUI's built-in Z-Image-Turbo workflow as the base graph, then replace the normal text encoders with this node:
- Connect Z-Image/Qwen
CLIPtoclip. - Connect
positiveandnegativeto the sampler. - For text-to-image, connect
latent_with_noise_maskto the sampler latent input, or connectwidth,height, andbatch_sizeto your own latent/image size nodes. - For image-to-image regional editing, connect the source image to
image, connect the same VAE used by the workflow tovae, then connectlatent_with_noise_maskto the sampler latent input. - Draw boxes in the editor and set each region prompt, optional region negative prompt, strength, and feather.
The node does not call any Ideogram API and does not emit Ideogram caption JSON. It produces native ComfyUI conditioning and masks.
Minimum wiring:
clipinput <- your Z-Image/Qwen CLIP.positiveoutput -> sampler positive.negativeoutput -> sampler negative.latent_with_noise_maskoutput -> sampler latent input.
You can ignore region_masks, combined_mask, source_image, regions_json, bboxes, width, and height for a simple text-to-image workflow.
Recommended wiring:
- Load or provide an image.
- Connect that image to this node's
imageinput. - Connect the workflow VAE to this node's
vaeinput. - Connect
positive-> sampler positive. - Connect
negative-> sampler negative. - Connect
latent_with_noise_mask-> sampler latent input. - Set sampler denoise around
0.6to0.8as a starting point.
With this wiring, the node encodes the source image and attaches the drawn boxes as the latent noise mask. The sampler should mainly change the masked regions. You normally do not need to connect combined_mask separately in this setup.
Use combined_mask separately only if your graph has a dedicated mask/inpaint input, for example an inpaint conditioning node, mask preview node, or a workflow that expects an external mask in addition to the latent.
Z-Image-Turbo does not natively consume Ideogram-style bounding-box caption JSON. The editor boxes are converted into ComfyUI masks and prompt text.
conditioning_mode:
single_prompt_fast: recommended default. Region prompts are folded into one Z-Image prompt and masks are output separately. This keeps sampling speed close to a normal Z-Image workflow.regional_conditioning_slow: experimental for Z-Image-Turbo. It emits one masked conditioning per region, can multiply sampler work by the number of regions, and often does not improve adherence because ZIT is not designed like Ideogram's API regional editor.
For text-to-image, region text is treated as additional positive prompt text with a rough area hint. Z-Image-Turbo may still ignore the rectangle or move the concept because it does not natively support hard regional prompt boxes.
For image-to-image edits, connect image and vae, then use latent_with_noise_mask as the sampler latent. The node's batch_size repeats a single source latent/mask when you want multiple variations. In testing, KSampler denoise around 0.6 to 0.8 is usually the practical range for visible regional changes while preserving the rest of the image. For changes like clothing replacement, draw the region larger than the exact clothing item so the model has enough context to rebuild edges, folds, and nearby transitions. Preservation still depends on Z-Image-Turbo, denoise, mask feather, region size, and the workflow; it is not equivalent to Ideogram 4's API-level regional editor.
default_feather controls mask edge softness in pixels. 0 is a hard rectangle edge, 8-24 is a normal soft edge, and 32+ creates a very broad transition.
default_region_strength controls mask opacity/weight for generated region masks and latent_with_noise_mask. In single_prompt_fast it does not make the text prompt stronger. For img2img edits, keep it near 1.0 unless you intentionally want a weaker/noisier mask edge effect; use sampler denoise for edit intensity.