NVIDIA · lfengad · May 12, 2026 · May 12, 2026 · May 12, 2026 · May 12, 2026
diff --git a/.claude/skills/cosmos-utils-vlm-migration/SKILL.md b/.claude/skills/cosmos-utils-vlm-migration/SKILL.md
@@ -0,0 +1,132 @@
+---
+name: cosmos-utils-vlm-migration
+description: >
+  Redirect edits, patches, or PRs that target pre-2026-05-18 paths under
+  cosmos_training/cosmos/utils/vfm/vlm/, cosmos_training/cosmos/utils/vfm/fused_adam.py,
+  or cosmos_training/cosmos/utils/vlm/compute_flops_qwen3vl.py to their post-refactor
+  locations. Use this skill whenever a diff, cherry-pick, rebase, code-review suggestion,
+  blame trail, or external snippet references the old paths/imports, OR when applying any
+  upstream change that touches the vlm utils tree. Triggers on: "cherry-pick", "rebase",
+  "apply patch", "port this change", "merge upstream", "from cosmos.utils.vfm.vlm",
+  "from cosmos.utils.vfm.fused_adam", "compute_flops_qwen3vl", or any edit whose target
+  file path includes utils/vfm/vlm/ or utils/vfm/fused_adam.py. Use proactively before
+  applying any change to these areas of the repo.
+---
+
+# Cosmos utils/vlm consolidation — pre-2026-05-18 → post-refactor mapping
+
+On 2026-05-18 the duplicated `utils/vfm/vlm/` tree was merged into `utils/vlm/`, and
+`utils/vfm/fused_adam.py` was promoted to top-level `utils/fused_adam.py`. Any change
+that targets the old paths must be **rewritten** against the new layout before it can be
+applied — the old files have been deleted from HEAD.
+
+## Hard rules
+
+1. **Never** restore the deleted files (`utils/vfm/vlm/*`, `utils/vfm/fused_adam.py`,
+   `utils/vlm/compute_flops_qwen3vl.py`). If a patch tries to, it's stale — redirect.
+2. **Always** verify the new file already contains the equivalent feature before adding
+   it. The merged files are a *superset* of both forks' behavior; the change you're
+   porting may already be present.
+3. If you can't find an equivalent symbol in the new location, stop and ask — silent
+   feature loss is worse than the consolidation churn.
+
+## Path redirect table
+
+| Old import / file path | New import / file path | Notes |
+|---|---|---|
+| `cosmos.utils.vfm.vlm.constant` | `cosmos.utils.vlm.constant` | identical names exported |
+| `cosmos.utils.vfm.vlm.create_position_ids` | `cosmos.utils.vlm.create_position_ids` | `get_position_ids`, `get_rope_index_qwen3_vl` |
+| `cosmos.utils.vfm.vlm.flop_calculator` | `cosmos.utils.vlm.flop_calculator` | `FlopCalculator` |
+| `cosmos.utils.vfm.vlm.optimizer` | `cosmos.utils.vlm.optimizer` | `OptimizerConfig`, `build_optimizers`, `build_lr_schedulers` |
+| `cosmos.utils.vfm.vlm.pretrained_models_downloader` | `cosmos.utils.vlm.pretrained_models_downloader` | `maybe_download_hf_model_from_s3`, `parallel_download_s3_prefix_to_dir`, `s3_dir_exists`, `has_model_weights`, `_load_s3_credentials`, `_download_from_hf_hub` |
+| `cosmos.utils.vfm.fused_adam` | `cosmos.utils.fused_adam` | `FusedAdam` (DTensor-aware) |
+| `cosmos.utils.vlm.compute_flops_qwen3vl` | `cosmos.tools.flops.qwen3_vl` | `compute_qwen3vl_flops_from_config` now accepts `is_causal` (defaults to True; the only in-tree caller, `utils/vlm/flop_calculator.py`, passes `is_causal=False`). Numeric output verified bit-identical when `is_causal=False`. |
+
+## Feature mapping inside merged files (what came from where)
+
+If a backport touches a specific symbol/feature, this tells you whether the new file
+already has it.
+
+### `utils/vlm/optimizer.py` — `OptimizerConfig`
+Contains the union of both forks:
+- Legacy named freeze flags: `freeze_vision_encoder`, `freeze_mm_projector`, `freeze_llm` (both forks)
+- `freeze_llm_moe_gates: bool = False` — was vlm-only (declared but not yet referenced in code as of merge)
+- `trainable_params: Optional[list[str]] = None` — was vfm/vlm-only; regex whitelist; enforced in `__attrs_post_init__`
+- `frozen_params: Optional[list[str]] = None` — was vfm/vlm-only; regex blacklist; mutually exclusive with `trainable_params`
+- `betas` now wrapped in `tuple(...)` inside `build_optimizers` — was vfm/vlm-only bugfix
+
+### `utils/vlm/pretrained_models_downloader.py`
+Contains the union:
+- `resolve_hf_model_store(credentials, bucket)` — was vlm-only; maps checkpoint-store creds to permanent HF model store
+- `_load_s3_credentials(credential_path)` — was vfm/vlm-only; env-var-aware via `cosmos.utils.easy_io.backends.auto_auth` (replaces raw `json.load(open(...))`)
+- `_download_from_hf_hub(model_name_or_path, include_model_weights)` — was vfm/vlm-only; HF Hub fallback when no S3 creds
+- `_stream_download` (inside `parallel_download_s3_prefix_to_dir`) — was vlm-only; bypasses ETag validation for GCS-compatible buckets
+- `maybe_download_hf_model_from_s3` body:
+  - Local-dir short-circuit (`if os.path.isdir(model_name_or_path)`) — was vfm/vlm-only
+  - No-credentials → `_download_from_hf_hub` branch — was vfm/vlm-only
+  - `not INTERNAL` → `CheckpointConfig.maybe_from_uri` + `download_checkpoint_v2` branch — was vfm/vlm-only
+  - Cache check accepts `vocab.json` OR `tokenizer.json` — was vlm-only (vfm/vlm checked only `vocab.json`)
+
+### `utils/vlm/flop_calculator.py`
+The vfm/vlm version was adopted wholesale:
+- Imports from `cosmos.tools.flops.qwen3_vl` (canonical), not from sibling `compute_flops_qwen3vl`
+- Adds `_IS_CAUSAL_FOR_CALIBRATION: bool = False` class constant
+- Calls `compute_qwen3vl_flops_from_config(..., is_causal=self._IS_CAUSAL_FOR_CALIBRATION)`
+- Numeric verification (2026-05-18): output dict bit-identical to old behavior across dense/MoE × text/image/video cases when `is_causal=False`
+
+### `utils/vlm/create_position_ids.py`, `utils/vlm/constant.py`
+The vfm/vlm version was adopted wholesale. Logic-identical to the prior `utils/vlm/`
+version — only docstrings, type annotations, and `Optional[T]` → `T | None` differ.
+
+### `utils/fused_adam.py` (was `utils/vfm/fused_adam.py`)
+DTensor-aware via `cosmos.utils.misc.get_local_tensor_if_DTensor`. For non-DTensor params
+(the only kind the old top-level `utils/fused_adam.py` handled), behavior is equivalent:
+`get_local_tensor_if_DTensor(x)` is a no-op for regular tensors. TE import path is
+`transformer_engine_torch as tex` (unchanged from top-level pre-refactor).
+
+## NOT consolidated — two `fused_adam.py` remain by design
+
+`utils/fused_adam.py` and `utils/vlm/fused_adam.py` both still exist. They are **not**
+duplicates:
+
+- `utils/fused_adam.py`: imports `transformer_engine_torch as tex`, uses
+  `cosmos.utils.misc.get_local_tensor_if_DTensor`.
+- `utils/vlm/fused_adam.py`: imports `transformer_engine as te`, uses
+  `te.pytorch.optimizers.multi_tensor_adam`, with an inlined `get_local_tensor_if_DTensor`.
+
+These differ in their TE module path. Unifying them requires verifying that
+`transformer_engine_torch.multi_tensor_adam*` and
+`te.pytorch.optimizers.multi_tensor_adam*` resolve to equivalent CUDA kernels at the
+runtime TE version. **Do not unify without that runtime verification.**
+
+## Import sites that were redirected in the 2026-05-18 PR
+
+These already point at the new paths in HEAD; if you see an external patch still using
+the old paths, redirect:
+
+- `cosmos/model/vfm/vlm_model.py` (3 import sites)
+- `cosmos/model/vfm/algorithm/loss/cross_entropy.py`
+- `cosmos/data/vfm/augmentors/vlm/tokenize_data.py`
+- `cosmos/data/vfm/processors/base.py`
+- `cosmos/data/vfm/processors/__init__.py`
+- `cosmos/utils/vfm/optimizer.py` (the `FusedAdam` lazy import)
+
+## Workflow when applying any change to the vlm utils area
+
+1. **Read the patch target path.** If it matches an entry in the redirect table, rewrite
+   the path before applying.
+2. **Check feature mapping above.** If the change adds/modifies a feature listed under
+   "Feature mapping inside merged files," confirm the merged file's current state — the
+   change may already be present (in which case it's a no-op), partially present (so you
+   need to merge carefully), or absent (port it).
+3. **For `flop_calculator.py` or anything calling `compute_qwen3vl_flops_from_config`:**
+   if the change touches the FLOP computation, re-run the equivalence check (see
+   `[[utils-vfm-vlm-forks]]` memory for context) before assuming the dynamic batcher
+   calibration still holds.
+4. **Never** create a new `utils/vfm/vlm/` directory or restore deleted files. If a
+   patch can't be cleanly applied to the new layout, stop and ask the user.
+
+## Related memory
+
+`[[utils-vfm-vlm-forks]]` in the project memory captures the consolidation history and
+the reasoning behind the leftover `utils/vlm/fused_adam.py`.
diff --git a/.config/rumdl.toml b/.config/rumdl.toml
@@ -0,0 +1,43 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# https://rumdl.dev/global-settings/
+[global]
+flavor = "standard"
+exclude = [
+    "ATTRIBUTIONS.md",
+    "_src",
+]
+disable = [
+    "MD013", # line-length
+    "MD033", # inline-html
+    "MD040", # fenced-code-language
+]
+
+# https://rumdl.dev/rules/
+
+[per-file-ignores]
+"README.md" = [
+    "MD041" # first-line-heading
+]
+
+# ul-style
+[MD004]
+style = "dash"
+
+# table-format
+[MD060]
+enabled = true
+style = "aligned"
diff --git a/.coveragerc b/.coveragerc
@@ -0,0 +1,32 @@
+# https://coverage.readthedocs.io/en/latest/subprocess.html
+
+[run]
+data_file = outputs/coverage/coverage
+disable_warnings =
+    module-not-imported
+    no-data-collected
+parallel = True
+patch = subprocess
+
+[report]
+exclude_lines =
+    @overload
+    def __repr__
+    if __name__ == .__main__.:
+    if TYPE_CHECKING:
+    pragma: no cover
+    raise AssertionError
+    raise NotImplementedError
+omit =
+    *_test.py
+skip_empty = True
+show_missing = True
+
+[html]
+directory = outputs/coverage/html
+
+[json]
+output = outputs/coverage/coverage.json
+
+[xml]
+output = outputs/coverage/coverage.xml
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,8 @@
+.venv
+.git
+/checkpoints
+/datasets
+/output
+/examples/**/checkpoints
+/examples/**/output
+/examples/**/datasets
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,30 @@
+*.lock linguist-generated=true
+tests/data/** linguist-generated=true
+ATTRIBUTIONS.md linguist-generated=true
+
+assets/** filter=lfs diff=lfs merge=lfs -text
+
+# Video files
+*.mp4 filter=lfs diff=lfs merge=lfs -text
+*.avi filter=lfs diff=lfs merge=lfs -text
+*.mov filter=lfs diff=lfs merge=lfs -text
+*.mkv filter=lfs diff=lfs merge=lfs -text
+*.webm filter=lfs diff=lfs merge=lfs -text
+
+# Audio files
+*.wav filter=lfs diff=lfs merge=lfs -text
+*.mp3 filter=lfs diff=lfs merge=lfs -text
+*.flac filter=lfs diff=lfs merge=lfs -text
+*.aac filter=lfs diff=lfs merge=lfs -text
+
+# Image files
+*.jpg filter=lfs diff=lfs merge=lfs -text
+*.jpeg filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
+*.tiff filter=lfs diff=lfs merge=lfs -text
+*.bmp filter=lfs diff=lfs merge=lfs -text
+
+# Logo thumbnail is small and was committed as a regular git blob before
+# LFS rules were introduced. Keep it out of LFS to preserve the existing
+# blob.
+cosmos-logo-thumbnail.png -filter -diff -merge text=auto
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,65 @@
+---
+name: Bug Report
+about: Report a reproducible bug or unexpected behavior
+title: "[BUG] <short description>"
+labels: 'bug'
+assignees:
+    - spectralflight
+    - jeanachoi
+
+---
+
+## Bug Description
+
+<!-- Clear and concise description of the bug. What did you expect? What happened? -->
+
+## Reproduction Steps
+
+```bash
+# Minimal command or script to reproduce
+```
+
+**Reproducibility:**
+
+- [ ] Always
+- [ ] Intermittently (~___% of the time)
+- [ ] Only once
+
+## Expected vs. Actual Behavior
+
+|              | Description                 |
+| ------------ | --------------------------- |
+| **Expected** | What you expected to happen |
+| **Actual**   | What actually happened      |
+
+## Outputs
+
+<details>
+<summary>Error / Stack Trace</summary>
+
+<!-- Attach or paste error / stack trace -->
+
+</details>
+
+<details>
+<summary>Log Files</summary>
+
+<!-- Attach or paste logs from the output directory -->
+
+</details>
+
+## System Information
+
+| Field                        | Value                                       |
+| ---------------------------- | ------------------------------------------- |
+| **Environment**              | <!-- e.g. UV, Docker -->                    |
+| **Hardware**                 | <!-- e.g. DGX H100 x8, single A100 80GB --> |
+| **OS**                       | <!-- e.g. Ubuntu 22.04 / 24.04 -->          |
+| **GPU Driver**               | <!-- e.g. 580.95.05 -->                     |
+| **CUDA Version**             | <!-- e.g. 12.8.1 -->                        |
+| **Python Version**           | <!-- e.g. 3.13.3 -->                        |
+| **Package Version / Commit** | <!-- e.g. v1.2.3 or git SHA -->             |
+
+## Additional Context
+
+<!-- Workarounds tried, related issues, etc. -->
diff --git a/.github/workflows/pre-commit.yml b/.github/workflows/pre-commit.yml
@@ -0,0 +1,31 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: Pre-commit
+on:
+  pull_request:
+  push:
+    branches: [main]
+jobs:
+  pre-commit:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+        with:
+          lfs: true
+      - uses: actions/setup-python@v6
+      - uses: astral-sh/setup-uv@v7
+      - run: uvx pre-commit@4.5.1 run -a -c ci/.pre-commit-config-base.yaml
+      - run: uvx pre-commit@4.5.1 run -a