Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
362 changes: 362 additions & 0 deletions contrib/models/Cosmos3-Text2Image/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,362 @@
# Contrib Model: Cosmos3-Text2Image

NeuronX Distributed Inference implementation of NVIDIA Cosmos3 omnimodal
Mixture-of-Transformers (MoT) for text-to-image generation.

## Model Information

- **Models:** Cosmos3-Nano (16B), Cosmos3-Super-Text2Image (65B)
- **HuggingFace ID:** `nvidia/Cosmos3-Nano`, `nvidia/Cosmos3-Super-Text2Image`
- **Model Type:** Diffusion Transformer (MoT architecture)
- **Task:** Text-to-Image/Video Generation (512x512, 1024x1024, video up to 61 frames)
- **License:** Check HuggingFace model card

## Architecture Details

Cosmos3 uses a **Mixture-of-Transformers (MoT)** architecture:
- Dual-stream processing: text (understanding) and vision (generation) pathways
- Joint MMDiT-style attention: text uses causal self-attention, vision attends bidirectionally to all tokens
- Separate SwiGLU MLPs per stream (text MLP + generation MLP in each layer)
- M-RoPE (Multimodal Rotary Position Embedding) with 3 axes (T, H, W)
- QK normalization (per-head RMSNorm)
- GQA (Grouped Query Attention)
- VAE: AutoencoderKLWan with 48 latent channels, patch_size=2, spatial_compression=16
- Scheduler: UniPCMultistepScheduler (35 steps, flow matching)
- CFG scale: 6.0

| | Cosmos3-Nano | Cosmos3-Super |
|--|--|--|
| Parameters | 16B | 65B |
| hidden_size | 4096 | 5120 |
| intermediate_size | 12288 | 25600 |
| Layers | 36 | 64 |
| Q Heads | 32 | 64 |
| KV Heads | 8 | 8 |
| Instance | trn2.3xlarge | trn2.48xlarge |
| TP Degree | 4 | 8 |

## Validation Results

**Validated:** 2026-06-12
**SDK:** 2.30 (torch-neuronx 2.9.0.2.14.27725)

### Cosmos3-Nano (trn2.3xlarge, TP=4)

| Metric | 512x512 (35 steps) | 512x512 CFG-parallel | 1024x1024 (50 steps) | 1024x1024 CFG-parallel |
|--------|-------|-------|-------|-------|
| Backbone latency | 33.4 ms/call | 49.7 ms (batch=2) | 167.9 ms/call | 131.1 ms (batch=2) |
| E2E generation | **2.79s** | **2.23s** | **8.63s** | **6.79s** |
| Per-step latency | 78.2 ms | 62.2 ms | 167.9 ms | 131.1 ms |
| VAE decode | 50 ms | 50 ms | 231 ms | 231 ms |
| Speedup vs sequential | - | 20% | - | 21% |

### Cosmos3-Super-Text2Image (trn2.48xlarge, TP=8)

| Metric | 512x512 (35 steps) |
|--------|-------|
| Backbone latency | 79.5 ms/call |
| E2E generation | **5.81s** |
| Per-step latency | 164.6 ms |
| VAE decode | 50 ms |
| Image quality | High fidelity |

**Status:** VALIDATED (both variants)

## Setup

### Prerequisites

- AWS Neuron SDK 2.30+ (DLAMI `Deep Learning AMI Neuron (Ubuntu 24.04) 20260522`)
- trn2.3xlarge (Nano) or trn2.48xlarge (Super)
- `diffusers >= 0.39.0.dev0` (for Cosmos3 VAE support)

### Environment

```bash
# Activate pre-installed environment
source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate

# Install diffusers from source (needed for Cosmos3 VAE)
pip install git+https://github.com/huggingface/diffusers.git
```

### Download Model

```bash
# Nano (33 GB)
huggingface-cli download nvidia/Cosmos3-Nano --local-dir /home/ubuntu/Cosmos3-Nano

# Super (124 GB)
huggingface-cli download nvidia/Cosmos3-Super-Text2Image --local-dir /home/ubuntu/Cosmos3-Super-Text2Image
```

## Usage

### 1. Compile Backbone

```bash
# Nano at 512x512 (TP=4, ~2 min compile)
python examples/compile.py \
--model-path /home/ubuntu/Cosmos3-Nano \
--tp 4 \
--output /home/ubuntu/compiled_cosmos3_nano

# Nano at 1024x1024 (TP=4, ~2 min compile)
python examples/compile.py \
--model-path /home/ubuntu/Cosmos3-Nano \
--tp 4 \
--height 1024 --width 1024 \
--output /home/ubuntu/compiled_cosmos3_nano_1024

# Super at 512x512 (TP=8, ~5 min compile)
python examples/compile.py \
--model-path /home/ubuntu/Cosmos3-Super-Text2Image \
--tp 8 \
--output /home/ubuntu/compiled_cosmos3_super

# Super at 1024x1024 (TP=8, ~5 min compile)
python examples/compile.py \
--model-path /home/ubuntu/Cosmos3-Super-Text2Image \
--tp 8 \
--height 1024 --width 1024 \
--output /home/ubuntu/compiled_cosmos3_super_1024

# CFG-parallel (batch=2, ~20% faster generation):
python examples/compile.py \
--model-path /home/ubuntu/Cosmos3-Nano \
--tp 4 --cfg-parallel \
--output /home/ubuntu/compiled_cosmos3_nano_cfgp

python examples/compile.py \
--model-path /home/ubuntu/Cosmos3-Nano \
--tp 4 --height 1024 --width 1024 --cfg-parallel \
--output /home/ubuntu/compiled_cosmos3_nano_1024_cfgp
```

### 2. Compile VAE

```bash
# For 512x512 output
python examples/compile_vae.py \
--model-path /home/ubuntu/Cosmos3-Nano \
--output /home/ubuntu/compiled_vae/vae_512.pt

# For 1024x1024 output (~10 min compile)
python examples/compile_vae.py \
--model-path /home/ubuntu/Cosmos3-Nano \
--height 1024 --width 1024 \
--output /home/ubuntu/compiled_vae/vae_1024.pt
```

Note: The same compiled VAE works for both Nano and Super at the same resolution (same architecture).

### 3. Generate Images

```bash
# Nano at 512x512
python examples/generate.py \
--model-path /home/ubuntu/Cosmos3-Nano \
--compiled-path /home/ubuntu/compiled_cosmos3_nano \
--vae-path /home/ubuntu/compiled_vae/vae_512.pt \
--tp 4 \
--prompt "A cat sitting on a windowsill watching birds" \
--output cat_512.png

# Nano at 1024x1024
python examples/generate.py \
--model-path /home/ubuntu/Cosmos3-Nano \
--compiled-path /home/ubuntu/compiled_cosmos3_nano_1024 \
--vae-path /home/ubuntu/compiled_vae/vae_1024.pt \
--tp 4 --height 1024 --width 1024 --steps 50 \
--prompt "A majestic snow-covered mountain at sunrise" \
--output mountain_1024.png

# Super at 512x512
python examples/generate.py \
--model-path /home/ubuntu/Cosmos3-Super-Text2Image \
--compiled-path /home/ubuntu/compiled_cosmos3_super \
--vae-path /home/ubuntu/compiled_vae/vae_512.pt \
--tp 8 \
--prompt "A majestic snow-covered mountain at sunrise with golden light" \
--output mountain.png

# CFG-parallel mode (requires backbone compiled with --cfg-parallel):
python examples/generate.py \
--model-path /home/ubuntu/Cosmos3-Nano \
--compiled-path /home/ubuntu/compiled_cosmos3_nano_cfgp \
--vae-path /home/ubuntu/compiled_vae/vae_512.pt \
--tp 4 --cfg-parallel \
--prompt "A golden retriever in autumn leaves" \
--output dog_cfgp.png
```

### Python API

```python
import torch
import torch_neuronx
from src.modeling_cosmos3 import (
Cosmos3BackboneInferenceConfig,
NeuronCosmos3BackboneApplication,
)
from src.pipeline import (
build_position_ids, denoise, denormalize_latents, tokenize_prompt,
)
from neuronx_distributed_inference.models.config import NeuronConfig
from transformers import AutoTokenizer
from diffusers import UniPCMultistepScheduler

# Configure (Nano example)
neuron_config = NeuronConfig(tp_degree=4, world_size=4, torch_dtype=torch.bfloat16)
config = Cosmos3BackboneInferenceConfig(neuron_config=neuron_config)
config.max_text_len = 256
config.num_vision_patches = 256

# Load
app = NeuronCosmos3BackboneApplication(
model_path="/home/ubuntu/Cosmos3-Nano/transformer", config=config
)
app.load("/home/ubuntu/compiled_cosmos3_nano")

# Tokenize
tokenizer = AutoTokenizer.from_pretrained("/home/ubuntu/Cosmos3-Nano", trust_remote_code=True)
cond_ids, cond_len = tokenize_prompt(tokenizer, "A sunset over the ocean")
uncond_ids, uncond_len = tokenize_prompt(tokenizer, "", negative=True)

# Position IDs
cond_pos = build_position_ids(256, cond_len, T=1, pH=16, pW=16)
uncond_pos = build_position_ids(256, uncond_len, T=1, pH=16, pW=16)

# Generate
latents = torch.randn(1, 48, 1, 32, 32, dtype=torch.float32)
scheduler = UniPCMultistepScheduler.from_pretrained("/home/ubuntu/Cosmos3-Nano", subfolder="scheduler")

latents = denoise(app, cond_ids, uncond_ids, cond_pos, uncond_pos, scheduler, latents)
latents = denormalize_latents(latents, "/home/ubuntu/Cosmos3-Nano/vae/config.json")

# Decode with VAE
vae = torch.jit.load("/home/ubuntu/compiled_vae/vae_decoder.pt")
pixels = vae(latents.float()) # [1, 3, 1, 512, 512]
```

## Testing

```bash
# Run integration tests
export COSMOS3_MODEL_PATH=/home/ubuntu/Cosmos3-Nano
export COSMOS3_COMPILED_PATH=/home/ubuntu/compiled_cosmos3_nano
export COSMOS3_VAE_PATH=/home/ubuntu/compiled_vae/vae_decoder.pt
export COSMOS3_TP_DEGREE=4

pytest test/integration/test_model.py --capture=tee-sys -v

# Or run manually:
python test/integration/test_model.py
```

## Key Implementation Notes

1. **Channel ordering in patchify/unpatchify**: Uses spatial-first, channels-last
`(p_h, p_w, C)` matching the reference einsum `"cthpwq->thwpqc"`. Getting this
wrong produces a 16x16 repeating tile pattern.

2. **Temporal margin**: Vision position IDs use `actual_text_len + 15000` as temporal
offset, matching `unified_3d_mrope_temporal_modality_margin=15000` in the reference.

3. **Tokenization**: Must use the full chat template format with system prompt +
resolution template + special tokens (eos + vision_start).

4. **Warmup both CFG paths**: The backbone must be warmed up with both conditional and
unconditional inputs before timing. Without this, first-call overhead adds ~2.6s.

5. **Super model compilation**: Models with > 36 layers require `--verify-hlo=false`
and `--modular-flow-mac-threshold=10` to avoid pre-partition HBM verification failure.

## Compatibility Matrix

| Instance | Nano (TP=4) | Super (TP=8) |
|----------|-------------|--------------|
| trn2.3xlarge (LNC=2) | **Working** (512, 1024) | N/A (HBM limit) |
| trn2.48xlarge (LNC=2) | Working | **Working** (512, 1024) |

## Video Generation (Experimental)

The Cosmos3 backbone is **modality-agnostic** — the same compiled model that generates
images can generate video by providing temporal position IDs (T > 1 in the M-RoPE
encoding). No recompilation is needed if the total patch count matches an existing
compiled model.

### How It Works

The backbone processes a flat sequence of text + vision patches. For images, vision
patches come from a 2D spatial grid. For video, patches span a 3D grid (T × H × W):

| Modality | T_lat | pH × pW | Total Patches | Use Compiled Model |
|----------|-------|---------|---------------|--------------------|
| Image 512×512 | 1 | 16×16 | 256 | compile at `--height 512 --width 512` |
| Image 1024×1024 | 1 | 32×32 | 1024 | compile at `--height 1024 --width 1024` |
| Video 13f@512 | 4 | 16×16 | 1024 | **Reuse 1024p image model!** |
| Video 29f@512 | 8 | 16×16 | 2048 | compile at 2048 patches |
| Video 61f@512 | 16 | 16×16 | 4096 | compile at 4096 patches |

The temporal latent count is: `T_lat = (raw_frames - 1) // 4 + 1` (VAE temporal
compression factor = 4).

### Example: Generate Video with Existing 1024p Model

```python
from src.pipeline import build_position_ids, patchify, unpatchify, denoise

# Use T_lat=4 (13 raw frames) at 512x512 → 4×16×16 = 1024 patches
# Same compiled model as 1024x1024 image generation!
T_lat = 4
pH, pW = 16, 16

# Build video position IDs (temporal axis spans T_lat values)
cond_pos = build_position_ids(256, actual_text_len, T=T_lat, pH=pH, pW=pW)
uncond_pos = build_position_ids(256, uncond_text_len, T=T_lat, pH=pH, pW=pW)

# Initial noise with temporal dimension
latents = torch.randn(1, 48, T_lat, 32, 32, dtype=torch.float32)

# Denoise (pipeline handles patchify/unpatchify with T>1 automatically)
latents = denoise(backbone, cond_ids, uncond_ids, cond_pos, uncond_pos,
scheduler, latents, num_steps=35)
```

### Measured Video Performance (Nano, TP=4, trn2.3xlarge)

| Video Config | Raw Frames | T_lat | Patches | Per-call Latency | Total (35 steps) |
|-------------|-----------|-------|---------|-----------------|-------------------|
| 13f @ 512×512 | 13 | 4 | 1024 | ~83 ms | 5.83s |
| 29f @ 512×512 | 29 | 8 | 2048 | ~121 ms | 8.45s |
| 61f @ 512×512 | 61 | 16 | 4096 | ~239 ms | 16.73s |

### Limitations

- **VAE decode**: The compiled image VAE only handles T=1. Per-frame decoding works as
an approximation. A proper 3D video VAE compilation is needed for production quality.
- **Maximum sequence length**: Tested up to 8192 patches (seq_len=8448) on trn2.3xlarge.
Longer videos (189 frames = 41k patches) require context parallelism.
- **Video quality**: Without the proper 3D VAE, temporal consistency between decoded
frames depends on the per-frame decode approximation.

## Supported Resolutions

The backbone can be compiled at any resolution divisible by 32. Compile time and
latency scale with sequence length (number of vision patches).

| Resolution | Vision Patches | Total Seq Len | Compile Time (Nano) | Latency/Step (Nano) |
|-----------|---------------|---------------|--------------------|--------------------|
| 512x512 | 256 | 512 | ~2 min | 33.4 ms |
| 768x768 | 576 | 832 | ~2 min | ~50 ms |
| 1024x1024 | 1024 | 1280 | ~2 min | 80.6 ms |

The VAE must be compiled separately for each target resolution.


## Maintainer

Annapurna Labs

**Last Updated:** 2026-06-12
Loading