Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
6c4bc1a
Import cosmos3 codebase from cosmos3-internal
lfengad May 12, 2026
4fc15a5
Merge Cosmos3 README content with deprecated-repo notice
lfengad May 12, 2026
0858cbc
Restructure: rename cosmos3 to cosmos-inference, add cosmos/ skeleton
lfengad May 12, 2026
619256f
Consolidate Cosmos3 content under cosmos-inference/, restore root layout
lfengad May 12, 2026
4e121d7
Add stub pyproject.toml and uv.lock for cosmos/ package at root
lfengad May 12, 2026
6dfdd92
Sync cosmos-inference/ with cosmos3-internal origin/main
lfengad May 14, 2026
c82be1c
Add minimal release
yy-code-nv May 14, 2026
0015a7f
Sync cosmos-inference/ with cosmos3-internal main (bf3b3ac)
lfengad May 15, 2026
8f24e33
Sync cosmos-inference/ with cosmos3-internal main (fc7f97d)
lfengad May 16, 2026
d26993d
Scaffold OSS documentation skeleton
lfengad May 16, 2026
e1c596b
Fill in docs/setup.md and sync root uv env with cosmos-inference
lfengad May 16, 2026
fd6b189
Frame root as framework, fill code_structure and inference docs
lfengad May 16, 2026
c7e8132
Add OSS hygiene files at root and mirror cosmos-inference with upstream
lfengad May 16, 2026
634c9a1
Port remaining dev configs and Docker assets from cosmos-inference
lfengad May 16, 2026
6b0d5d9
Apply pre-commit fixes; add rumdl config and .rumdl_cache gitignore
lfengad May 16, 2026
f64874f
AGENTS.md: add pointer to inference-side skills
lfengad May 16, 2026
5efd253
cosmos_training: re-release with 4 verified smokes + deterministic mode
yy-code-nv May 18, 2026
f831402
gitignore: untrack cosmos_training_meta/
yy-code-nv May 18, 2026
ec9841d
Merge branch 'main' into yangyangt/minimal_release
lfengad May 18, 2026
6479f77
Merge pull request #2 from nvidia-cosmos/yangyangt/minimal_release
lfengad May 18, 2026
af0872d
cosmos_training: relocate cosmos package source under cosmos_training/
lfengad May 18, 2026
96f95af
cosmos_training: consolidate utils/vfm/vlm and utils/vfm/fused_adam
lfengad May 18, 2026
bfc2e80
cosmos_training: mark viewer.py executable to match its shebang
lfengad May 18, 2026
01e6357
cosmos_training: clear executable bit on avae_utils library modules
lfengad May 18, 2026
d7174da
docs: fix broken relative links in READMEs
lfengad May 18, 2026
271a0ee
tests: add regression test for launch scripts loss/gradnorm
lfengad May 18, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
132 changes: 132 additions & 0 deletions .claude/skills/cosmos-utils-vlm-migration/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
---
name: cosmos-utils-vlm-migration
description: >
Redirect edits, patches, or PRs that target pre-2026-05-18 paths under
cosmos_training/cosmos/utils/vfm/vlm/, cosmos_training/cosmos/utils/vfm/fused_adam.py,
or cosmos_training/cosmos/utils/vlm/compute_flops_qwen3vl.py to their post-refactor
locations. Use this skill whenever a diff, cherry-pick, rebase, code-review suggestion,
blame trail, or external snippet references the old paths/imports, OR when applying any
upstream change that touches the vlm utils tree. Triggers on: "cherry-pick", "rebase",
"apply patch", "port this change", "merge upstream", "from cosmos.utils.vfm.vlm",
"from cosmos.utils.vfm.fused_adam", "compute_flops_qwen3vl", or any edit whose target
file path includes utils/vfm/vlm/ or utils/vfm/fused_adam.py. Use proactively before
applying any change to these areas of the repo.
---

# Cosmos utils/vlm consolidation — pre-2026-05-18 → post-refactor mapping

On 2026-05-18 the duplicated `utils/vfm/vlm/` tree was merged into `utils/vlm/`, and
`utils/vfm/fused_adam.py` was promoted to top-level `utils/fused_adam.py`. Any change
that targets the old paths must be **rewritten** against the new layout before it can be
applied — the old files have been deleted from HEAD.

## Hard rules

1. **Never** restore the deleted files (`utils/vfm/vlm/*`, `utils/vfm/fused_adam.py`,
`utils/vlm/compute_flops_qwen3vl.py`). If a patch tries to, it's stale — redirect.
2. **Always** verify the new file already contains the equivalent feature before adding
it. The merged files are a *superset* of both forks' behavior; the change you're
porting may already be present.
3. If you can't find an equivalent symbol in the new location, stop and ask — silent
feature loss is worse than the consolidation churn.

## Path redirect table

| Old import / file path | New import / file path | Notes |
|---|---|---|
| `cosmos.utils.vfm.vlm.constant` | `cosmos.utils.vlm.constant` | identical names exported |
| `cosmos.utils.vfm.vlm.create_position_ids` | `cosmos.utils.vlm.create_position_ids` | `get_position_ids`, `get_rope_index_qwen3_vl` |
| `cosmos.utils.vfm.vlm.flop_calculator` | `cosmos.utils.vlm.flop_calculator` | `FlopCalculator` |
| `cosmos.utils.vfm.vlm.optimizer` | `cosmos.utils.vlm.optimizer` | `OptimizerConfig`, `build_optimizers`, `build_lr_schedulers` |
| `cosmos.utils.vfm.vlm.pretrained_models_downloader` | `cosmos.utils.vlm.pretrained_models_downloader` | `maybe_download_hf_model_from_s3`, `parallel_download_s3_prefix_to_dir`, `s3_dir_exists`, `has_model_weights`, `_load_s3_credentials`, `_download_from_hf_hub` |
| `cosmos.utils.vfm.fused_adam` | `cosmos.utils.fused_adam` | `FusedAdam` (DTensor-aware) |
| `cosmos.utils.vlm.compute_flops_qwen3vl` | `cosmos.tools.flops.qwen3_vl` | `compute_qwen3vl_flops_from_config` now accepts `is_causal` (defaults to True; the only in-tree caller, `utils/vlm/flop_calculator.py`, passes `is_causal=False`). Numeric output verified bit-identical when `is_causal=False`. |

## Feature mapping inside merged files (what came from where)

If a backport touches a specific symbol/feature, this tells you whether the new file
already has it.

### `utils/vlm/optimizer.py` — `OptimizerConfig`
Contains the union of both forks:
- Legacy named freeze flags: `freeze_vision_encoder`, `freeze_mm_projector`, `freeze_llm` (both forks)
- `freeze_llm_moe_gates: bool = False` — was vlm-only (declared but not yet referenced in code as of merge)
- `trainable_params: Optional[list[str]] = None` — was vfm/vlm-only; regex whitelist; enforced in `__attrs_post_init__`
- `frozen_params: Optional[list[str]] = None` — was vfm/vlm-only; regex blacklist; mutually exclusive with `trainable_params`
- `betas` now wrapped in `tuple(...)` inside `build_optimizers` — was vfm/vlm-only bugfix

### `utils/vlm/pretrained_models_downloader.py`
Contains the union:
- `resolve_hf_model_store(credentials, bucket)` — was vlm-only; maps checkpoint-store creds to permanent HF model store
- `_load_s3_credentials(credential_path)` — was vfm/vlm-only; env-var-aware via `cosmos.utils.easy_io.backends.auto_auth` (replaces raw `json.load(open(...))`)
- `_download_from_hf_hub(model_name_or_path, include_model_weights)` — was vfm/vlm-only; HF Hub fallback when no S3 creds
- `_stream_download` (inside `parallel_download_s3_prefix_to_dir`) — was vlm-only; bypasses ETag validation for GCS-compatible buckets
- `maybe_download_hf_model_from_s3` body:
- Local-dir short-circuit (`if os.path.isdir(model_name_or_path)`) — was vfm/vlm-only
- No-credentials → `_download_from_hf_hub` branch — was vfm/vlm-only
- `not INTERNAL` → `CheckpointConfig.maybe_from_uri` + `download_checkpoint_v2` branch — was vfm/vlm-only
- Cache check accepts `vocab.json` OR `tokenizer.json` — was vlm-only (vfm/vlm checked only `vocab.json`)

### `utils/vlm/flop_calculator.py`
The vfm/vlm version was adopted wholesale:
- Imports from `cosmos.tools.flops.qwen3_vl` (canonical), not from sibling `compute_flops_qwen3vl`
- Adds `_IS_CAUSAL_FOR_CALIBRATION: bool = False` class constant
- Calls `compute_qwen3vl_flops_from_config(..., is_causal=self._IS_CAUSAL_FOR_CALIBRATION)`
- Numeric verification (2026-05-18): output dict bit-identical to old behavior across dense/MoE × text/image/video cases when `is_causal=False`

### `utils/vlm/create_position_ids.py`, `utils/vlm/constant.py`
The vfm/vlm version was adopted wholesale. Logic-identical to the prior `utils/vlm/`
version — only docstrings, type annotations, and `Optional[T]` → `T | None` differ.

### `utils/fused_adam.py` (was `utils/vfm/fused_adam.py`)
DTensor-aware via `cosmos.utils.misc.get_local_tensor_if_DTensor`. For non-DTensor params
(the only kind the old top-level `utils/fused_adam.py` handled), behavior is equivalent:
`get_local_tensor_if_DTensor(x)` is a no-op for regular tensors. TE import path is
`transformer_engine_torch as tex` (unchanged from top-level pre-refactor).

## NOT consolidated — two `fused_adam.py` remain by design

`utils/fused_adam.py` and `utils/vlm/fused_adam.py` both still exist. They are **not**
duplicates:

- `utils/fused_adam.py`: imports `transformer_engine_torch as tex`, uses
`cosmos.utils.misc.get_local_tensor_if_DTensor`.
- `utils/vlm/fused_adam.py`: imports `transformer_engine as te`, uses
`te.pytorch.optimizers.multi_tensor_adam`, with an inlined `get_local_tensor_if_DTensor`.

These differ in their TE module path. Unifying them requires verifying that
`transformer_engine_torch.multi_tensor_adam*` and
`te.pytorch.optimizers.multi_tensor_adam*` resolve to equivalent CUDA kernels at the
runtime TE version. **Do not unify without that runtime verification.**

## Import sites that were redirected in the 2026-05-18 PR

These already point at the new paths in HEAD; if you see an external patch still using
the old paths, redirect:

- `cosmos/model/vfm/vlm_model.py` (3 import sites)
- `cosmos/model/vfm/algorithm/loss/cross_entropy.py`
- `cosmos/data/vfm/augmentors/vlm/tokenize_data.py`
- `cosmos/data/vfm/processors/base.py`
- `cosmos/data/vfm/processors/__init__.py`
- `cosmos/utils/vfm/optimizer.py` (the `FusedAdam` lazy import)

## Workflow when applying any change to the vlm utils area

1. **Read the patch target path.** If it matches an entry in the redirect table, rewrite
the path before applying.
2. **Check feature mapping above.** If the change adds/modifies a feature listed under
"Feature mapping inside merged files," confirm the merged file's current state — the
change may already be present (in which case it's a no-op), partially present (so you
need to merge carefully), or absent (port it).
3. **For `flop_calculator.py` or anything calling `compute_qwen3vl_flops_from_config`:**
if the change touches the FLOP computation, re-run the equivalence check (see
`[[utils-vfm-vlm-forks]]` memory for context) before assuming the dynamic batcher
calibration still holds.
4. **Never** create a new `utils/vfm/vlm/` directory or restore deleted files. If a
patch can't be cleanly applied to the new layout, stop and ask the user.

## Related memory

`[[utils-vfm-vlm-forks]]` in the project memory captures the consolidation history and
the reasoning behind the leftover `utils/vlm/fused_adam.py`.
43 changes: 43 additions & 0 deletions .config/rumdl.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# https://rumdl.dev/global-settings/
[global]
flavor = "standard"
exclude = [
"ATTRIBUTIONS.md",
"_src",
]
disable = [
"MD013", # line-length
"MD033", # inline-html
"MD040", # fenced-code-language
]

# https://rumdl.dev/rules/

[per-file-ignores]
"README.md" = [
"MD041" # first-line-heading
]

# ul-style
[MD004]
style = "dash"

# table-format
[MD060]
enabled = true
style = "aligned"
32 changes: 32 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# https://coverage.readthedocs.io/en/latest/subprocess.html

[run]
data_file = outputs/coverage/coverage
disable_warnings =
module-not-imported
no-data-collected
parallel = True
patch = subprocess

[report]
exclude_lines =
@overload
def __repr__
if __name__ == .__main__.:
if TYPE_CHECKING:
pragma: no cover
raise AssertionError
raise NotImplementedError
omit =
*_test.py
skip_empty = True
show_missing = True

[html]
directory = outputs/coverage/html

[json]
output = outputs/coverage/coverage.json

[xml]
output = outputs/coverage/coverage.xml
8 changes: 8 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
.venv
.git
/checkpoints
/datasets
/output
/examples/**/checkpoints
/examples/**/output
/examples/**/datasets
30 changes: 30 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
*.lock linguist-generated=true
tests/data/** linguist-generated=true
ATTRIBUTIONS.md linguist-generated=true

assets/** filter=lfs diff=lfs merge=lfs -text

# Video files
*.mp4 filter=lfs diff=lfs merge=lfs -text
*.avi filter=lfs diff=lfs merge=lfs -text
*.mov filter=lfs diff=lfs merge=lfs -text
*.mkv filter=lfs diff=lfs merge=lfs -text
*.webm filter=lfs diff=lfs merge=lfs -text

# Audio files
*.wav filter=lfs diff=lfs merge=lfs -text
*.mp3 filter=lfs diff=lfs merge=lfs -text
*.flac filter=lfs diff=lfs merge=lfs -text
*.aac filter=lfs diff=lfs merge=lfs -text

# Image files
*.jpg filter=lfs diff=lfs merge=lfs -text
*.jpeg filter=lfs diff=lfs merge=lfs -text
*.png filter=lfs diff=lfs merge=lfs -text
*.tiff filter=lfs diff=lfs merge=lfs -text
*.bmp filter=lfs diff=lfs merge=lfs -text

# Logo thumbnail is small and was committed as a regular git blob before
# LFS rules were introduced. Keep it out of LFS to preserve the existing
# blob.
cosmos-logo-thumbnail.png -filter -diff -merge text=auto
65 changes: 65 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
name: Bug Report
about: Report a reproducible bug or unexpected behavior
title: "[BUG] <short description>"
labels: 'bug'
assignees:
- spectralflight
- jeanachoi

---

## Bug Description

<!-- Clear and concise description of the bug. What did you expect? What happened? -->

## Reproduction Steps

```bash
# Minimal command or script to reproduce
```

**Reproducibility:**

- [ ] Always
- [ ] Intermittently (~___% of the time)
- [ ] Only once

## Expected vs. Actual Behavior

| | Description |
| ------------ | --------------------------- |
| **Expected** | What you expected to happen |
| **Actual** | What actually happened |

## Outputs

<details>
<summary>Error / Stack Trace</summary>

<!-- Attach or paste error / stack trace -->

</details>

<details>
<summary>Log Files</summary>

<!-- Attach or paste logs from the output directory -->

</details>

## System Information

| Field | Value |
| ---------------------------- | ------------------------------------------- |
| **Environment** | <!-- e.g. UV, Docker --> |
| **Hardware** | <!-- e.g. DGX H100 x8, single A100 80GB --> |
| **OS** | <!-- e.g. Ubuntu 22.04 / 24.04 --> |
| **GPU Driver** | <!-- e.g. 580.95.05 --> |
| **CUDA Version** | <!-- e.g. 12.8.1 --> |
| **Python Version** | <!-- e.g. 3.13.3 --> |
| **Package Version / Commit** | <!-- e.g. v1.2.3 or git SHA --> |

## Additional Context

<!-- Workarounds tried, related issues, etc. -->
31 changes: 31 additions & 0 deletions .github/workflows/pre-commit.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: Pre-commit
on:
pull_request:
push:
branches: [main]
jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
with:
lfs: true
- uses: actions/setup-python@v6
- uses: astral-sh/setup-uv@v7
- run: uvx pre-commit@4.5.1 run -a -c ci/.pre-commit-config-base.yaml
- run: uvx pre-commit@4.5.1 run -a
Loading
Loading