Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
6c4bc1a
Import cosmos3 codebase from cosmos3-internal
lfengad May 12, 2026
4fc15a5
Merge Cosmos3 README content with deprecated-repo notice
lfengad May 12, 2026
0858cbc
Restructure: rename cosmos3 to cosmos-inference, add cosmos/ skeleton
lfengad May 12, 2026
619256f
Consolidate Cosmos3 content under cosmos-inference/, restore root layout
lfengad May 12, 2026
4e121d7
Add stub pyproject.toml and uv.lock for cosmos/ package at root
lfengad May 12, 2026
6dfdd92
Sync cosmos-inference/ with cosmos3-internal origin/main
lfengad May 14, 2026
c82be1c
Add minimal release
yy-code-nv May 14, 2026
0015a7f
Sync cosmos-inference/ with cosmos3-internal main (bf3b3ac)
lfengad May 15, 2026
8f24e33
Sync cosmos-inference/ with cosmos3-internal main (fc7f97d)
lfengad May 16, 2026
d26993d
Scaffold OSS documentation skeleton
lfengad May 16, 2026
e1c596b
Fill in docs/setup.md and sync root uv env with cosmos-inference
lfengad May 16, 2026
fd6b189
Frame root as framework, fill code_structure and inference docs
lfengad May 16, 2026
c7e8132
Add OSS hygiene files at root and mirror cosmos-inference with upstream
lfengad May 16, 2026
634c9a1
Port remaining dev configs and Docker assets from cosmos-inference
lfengad May 16, 2026
6b0d5d9
Apply pre-commit fixes; add rumdl config and .rumdl_cache gitignore
lfengad May 16, 2026
f64874f
AGENTS.md: add pointer to inference-side skills
lfengad May 16, 2026
5efd253
cosmos_training: re-release with 4 verified smokes + deterministic mode
yy-code-nv May 18, 2026
f831402
gitignore: untrack cosmos_training_meta/
yy-code-nv May 18, 2026
ec9841d
Merge branch 'main' into yangyangt/minimal_release
lfengad May 18, 2026
6479f77
Merge pull request #2 from nvidia-cosmos/yangyangt/minimal_release
lfengad May 18, 2026
af0872d
cosmos_training: relocate cosmos package source under cosmos_training/
lfengad May 18, 2026
9303ba4
tests: add regression test for launch scripts loss/gradnorm
lfengad May 18, 2026
bfb043d
cosmos_training: consolidate utils/vfm/vlm and utils/vfm/fused_adam
lfengad May 18, 2026
0e7b015
cosmos_training: mark viewer.py executable to match its shebang
lfengad May 18, 2026
63cfbfb
cosmos_training: clear executable bit on avae_utils library modules
lfengad May 18, 2026
21639f7
docs: fix broken relative links in READMEs
lfengad May 18, 2026
b66d9f8
cosmos_training: drop dead utils/optim_instantiate.py and utils/configs/
lfengad May 18, 2026
a4f7ea1
cosmos_training: drop the dead utils/one_logger/ subsystem
lfengad May 18, 2026
1c55568
cosmos_training: drop orphan training_telemetry/context_managers.py
lfengad May 18, 2026
004b14d
cosmos_training: drop three more orphans (flop_calculator + env_parsers)
lfengad May 18, 2026
13dc960
skill: broaden cosmos-utils-vlm-migration to cover follow-up deletions
lfengad May 18, 2026
d3be55a
pre-commit: exclude .claude/ from project hooks
lfengad May 18, 2026
52ed386
lint
lfengad May 18, 2026
de51b60
docs: fix stale references in training_telemetry README
lfengad May 18, 2026
0900113
cosmos_training: add SPDX/Apache-2.0 headers to files missing them
lfengad May 18, 2026
e50d804
Merge pull request #5 from nvidia-cosmos/liangf/utils-cleanup
lfengad May 19, 2026
00c1394
Do not set ckpt_type to dummy for VFM.
foreverlms May 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
185 changes: 185 additions & 0 deletions .claude/skills/cosmos-utils-vlm-migration/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
---
name: cosmos-utils-vlm-migration
description: >
Redirect edits, patches, or PRs that target pre-2026-05-18 paths under
cosmos_training/cosmos/utils/. Covers (a) the vfm/vlm consolidation — utils/vfm/vlm/,
utils/vfm/fused_adam.py, utils/vlm/compute_flops_qwen3vl.py — and (b) the follow-up
cleanup pass that removed dead subsystems: utils/one_logger/, utils/optim_instantiate.py,
utils/configs/lr_scheduler.py, utils/training_telemetry/context_managers.py,
utils/vlm/flop_calculator.py, utils/env_parsers/customization_env_parser.py,
utils/env_parsers/inference_env_parser.py, and the load_config(enable_one_logger=...)
parameter. Use this skill whenever a diff, cherry-pick, rebase, code-review suggestion,
blame trail, or external snippet references the old paths/imports, OR when applying
any upstream change that touches utils/. Triggers on: "cherry-pick", "rebase",
"apply patch", "port this change", "merge upstream", "from cosmos.utils.vfm.vlm",
"from cosmos.utils.vfm.fused_adam", "from cosmos.utils.one_logger",
"compute_flops_qwen3vl", "FlopCalculator", "optim_instantiate", "enable_one_logger",
"OneLoggerCallback", "CustomizationEnvParser", "InferenceEnvParser", or any edit whose
target file path is under utils/vfm/vlm/, utils/one_logger/, utils/configs/, or any
other path listed in the redirect table below. Use proactively before applying any
change to these areas of the repo.
---

# Cosmos utils/ refactor + cleanup — pre-2026-05-18 → post-refactor mapping

On 2026-05-18 several related changes landed in `cosmos_training/cosmos/utils/`:

1. The duplicated `utils/vfm/vlm/` tree was **merged into `utils/vlm/`** (feature union).
2. `utils/vfm/fused_adam.py` was **promoted to top-level `utils/fused_adam.py`** (DTensor-aware version).
3. A follow-up cleanup pass **deleted ~2400 lines** of dead/orphan code across the utils
tree: the broken `utils/one_logger/` subsystem, `utils/optim_instantiate.py`,
`utils/configs/`, `utils/training_telemetry/context_managers.py`,
`utils/vlm/flop_calculator.py`, and two `utils/env_parsers/*` files.

Any change targeting these old paths must be **rewritten** against the new layout before
it can be applied — the old files have been deleted from HEAD.

## Hard rules

1. **Never** restore deleted files. If a patch tries to add back any of:
`utils/vfm/vlm/*`, `utils/vfm/fused_adam.py`, `utils/vlm/compute_flops_qwen3vl.py`,
`utils/vlm/flop_calculator.py`, `utils/one_logger/*`, `utils/optim_instantiate.py`,
`utils/configs/lr_scheduler.py`, `utils/training_telemetry/context_managers.py`,
`utils/env_parsers/customization_env_parser.py`,
`utils/env_parsers/inference_env_parser.py` — it's stale; redirect or drop.
2. **Always** verify the new file already contains the equivalent feature before adding
it. The merged files are a *superset* of both forks' behavior; the change you're
porting may already be present.
3. If you can't find an equivalent symbol in the new location, stop and ask — silent
feature loss is worse than the consolidation churn.
4. **For deletions that originated as dead code** (one_logger, optim_instantiate,
flop_calculator, env_parsers, etc.), do not re-introduce them just because a patch
asks. Verify the requested behavior is genuinely needed first; the original code was
broken or orphan when deleted.

## Path redirect table

### Consolidated (merged into new location)

| Old import / file path | New import / file path | Notes |
|---|---|---|
| `cosmos.utils.vfm.vlm.constant` | `cosmos.utils.vlm.constant` | identical names exported |
| `cosmos.utils.vfm.vlm.create_position_ids` | `cosmos.utils.vlm.create_position_ids` | `get_position_ids`, `get_rope_index_qwen3_vl` |
| `cosmos.utils.vfm.vlm.optimizer` | `cosmos.utils.vlm.optimizer` | `OptimizerConfig`, `build_optimizers`, `build_lr_schedulers` |
| `cosmos.utils.vfm.vlm.pretrained_models_downloader` | `cosmos.utils.vlm.pretrained_models_downloader` | `maybe_download_hf_model_from_s3`, `parallel_download_s3_prefix_to_dir`, `s3_dir_exists`, `has_model_weights`, `_load_s3_credentials`, `_download_from_hf_hub` |
| `cosmos.utils.vfm.fused_adam` | `cosmos.utils.fused_adam` | `FusedAdam` (DTensor-aware) |
| `cosmos.utils.vlm.compute_flops_qwen3vl` | `cosmos.tools.flops.qwen3_vl` | `compute_qwen3vl_flops_from_config` now accepts `is_causal` (defaults to True). Numeric output verified bit-identical when `is_causal=False` against the deleted local impl. The only caller (`utils/vlm/flop_calculator.py`) has since been deleted as well, so this redirect is mostly historical. |

### Deleted entirely (do NOT re-add)

| Old import / file path | Why removed |
|---|---|
| `cosmos.utils.vfm.vlm.flop_calculator` | Preserved during initial vfm/vlm→vlm merge; confirmed zero in-tree refs and removed. `FlopCalculator` class no longer exists anywhere. |
| `cosmos.utils.vlm.flop_calculator` | Same class, post-merge location. Also deleted. |
| `cosmos.utils.one_logger.*` (whole subdir, 5 files + README) | `one_logger_override_utils.py` imported `OneLoggerCallback` from `cosmos.utils.callback`, but that class never existed in the post-imaginaire4 tree (also not in the `one-logger` PyPI package). The only call site (`load_config(..., enable_one_logger=True)`) silently swallowed the `ImportError`, so OneLogger has effectively been off for the duration. The whole subdir + the `enable_one_logger` parameter on `load_config` were removed. |
| `cosmos.utils.optim_instantiate` (`get_regular_param_group`, `get_base_optimizer`, `get_base_scheduler`) | Superseded by per-pipeline optimizer builders in `utils/vfm/optimizer.py` and `utils/vlm/optimizer.py`. Zero importers when deleted. |
| `cosmos.utils.configs.lr_scheduler` (`LambdaLinearSchedulerConfig`) | LazyCall wrapper around `LambdaLinearScheduler`. Zero importers when deleted. The implementation it wrapped — `cosmos.utils.functional.lr_scheduler.LambdaLinearScheduler` — is still alive. Whole `utils/configs/` subdir gone (it only had this one file). |
| `cosmos.utils.training_telemetry.context_managers` | Orphan. The live entry point `utils.training_telemetry.telemetry` uses `utils.py` and lazy-imports `TelemetryCallback` from `callback.py`; neither path touches the context_managers file. Zero external **or** internal cross-refs at delete time. |
| `cosmos.utils.env_parsers.customization_env_parser` (`CustomizationEnvParser`) | Inference-side AWS Fleet/Lambda env vars (FT_AWS_*, LAMBDA_STAGE, FLEET_FUNCTION). Zero importers in the training tree. |
| `cosmos.utils.env_parsers.inference_env_parser` (`InferenceEnvParser`) | Inference deployment env vars (TRT_ENABLED, NIM_DEPLOYMENT, PORT, MODEL_MODULE, …). Zero importers in the training tree. |

### Signature changes

| Function | Change |
|---|---|
| `cosmos.utils.config.load_config(config_path, opts, enable_one_logger=...)` | The `enable_one_logger` keyword argument was removed when `utils/one_logger/` was deleted. New signature: `load_config(config_path: str, opts: list[str]) -> Config`. Drop the kwarg from any patch that adds it back. |

## Feature mapping inside merged files (what came from where)

If a backport touches a specific symbol/feature, this tells you whether the new file
already has it.

### `utils/vlm/optimizer.py` — `OptimizerConfig`
Contains the union of both forks:
- Legacy named freeze flags: `freeze_vision_encoder`, `freeze_mm_projector`, `freeze_llm` (both forks)
- `freeze_llm_moe_gates: bool = False` — was vlm-only (declared but not yet referenced in code as of merge)
- `trainable_params: Optional[list[str]] = None` — was vfm/vlm-only; regex whitelist; enforced in `__attrs_post_init__`
- `frozen_params: Optional[list[str]] = None` — was vfm/vlm-only; regex blacklist; mutually exclusive with `trainable_params`
- `betas` now wrapped in `tuple(...)` inside `build_optimizers` — was vfm/vlm-only bugfix

### `utils/vlm/pretrained_models_downloader.py`
Contains the union:
- `resolve_hf_model_store(credentials, bucket)` — was vlm-only; maps checkpoint-store creds to permanent HF model store
- `_load_s3_credentials(credential_path)` — was vfm/vlm-only; env-var-aware via `cosmos.utils.easy_io.backends.auto_auth` (replaces raw `json.load(open(...))`)
- `_download_from_hf_hub(model_name_or_path, include_model_weights)` — was vfm/vlm-only; HF Hub fallback when no S3 creds
- `_stream_download` (inside `parallel_download_s3_prefix_to_dir`) — was vlm-only; bypasses ETag validation for GCS-compatible buckets
- `maybe_download_hf_model_from_s3` body:
- Local-dir short-circuit (`if os.path.isdir(model_name_or_path)`) — was vfm/vlm-only
- No-credentials → `_download_from_hf_hub` branch — was vfm/vlm-only
- `not INTERNAL` → `CheckpointConfig.maybe_from_uri` + `download_checkpoint_v2` branch — was vfm/vlm-only
- Cache check accepts `vocab.json` OR `tokenizer.json` — was vlm-only (vfm/vlm checked only `vocab.json`)

### `utils/vlm/flop_calculator.py` — DELETED
Initially merged from vfm/vlm/. Subsequently determined to have zero in-tree
references (the dynamic batcher this was built for never wired it up here) and
deleted on 2026-05-18. The bit-identical FLOP numeric verification still holds
for `cosmos.tools.flops.qwen3_vl.compute_qwen3vl_flops_from_config(..., is_causal=False)`
if you ever need to rebuild this calculator.

### `utils/vlm/create_position_ids.py`, `utils/vlm/constant.py`
The vfm/vlm version was adopted wholesale. Logic-identical to the prior `utils/vlm/`
version — only docstrings, type annotations, and `Optional[T]` → `T | None` differ.

### `utils/fused_adam.py` (was `utils/vfm/fused_adam.py`)
DTensor-aware via `cosmos.utils.misc.get_local_tensor_if_DTensor`. For non-DTensor params
(the only kind the old top-level `utils/fused_adam.py` handled), behavior is equivalent:
`get_local_tensor_if_DTensor(x)` is a no-op for regular tensors. TE import path is
`transformer_engine_torch as tex` (unchanged from top-level pre-refactor).

## NOT consolidated — two `fused_adam.py` remain by design

`utils/fused_adam.py` and `utils/vlm/fused_adam.py` both still exist. They are **not**
duplicates:

- `utils/fused_adam.py`: imports `transformer_engine_torch as tex`, uses
`cosmos.utils.misc.get_local_tensor_if_DTensor`.
- `utils/vlm/fused_adam.py`: imports `transformer_engine as te`, uses
`te.pytorch.optimizers.multi_tensor_adam`, with an inlined `get_local_tensor_if_DTensor`.

These differ in their TE module path. Unifying them requires verifying that
`transformer_engine_torch.multi_tensor_adam*` and
`te.pytorch.optimizers.multi_tensor_adam*` resolve to equivalent CUDA kernels at the
runtime TE version. **Do not unify without that runtime verification.**

## Import sites that were redirected on 2026-05-18

These already point at the new paths in HEAD; if you see an external patch still using
the old paths, redirect:

**vfm/vlm consolidation:**
- `cosmos/model/vfm/vlm_model.py` (3 import sites)
- `cosmos/model/vfm/algorithm/loss/cross_entropy.py`
- `cosmos/data/vfm/augmentors/vlm/tokenize_data.py`
- `cosmos/data/vfm/processors/base.py`
- `cosmos/data/vfm/processors/__init__.py`
- `cosmos/utils/vfm/optimizer.py` (the `FusedAdam` lazy import)

**one_logger removal:**
- `cosmos/utils/config.py` — `load_config` lost the `enable_one_logger` parameter and the gated lazy-import block
- `scripts/train.py` — dropped `enable_one_logger=True` kwarg
- `cosmos/data/vfm/action/compute_action_stats.py` — dropped `enable_one_logger=False` kwarg

## Workflow when applying any change to the utils tree

1. **Read the patch target path.** If it matches an entry in the redirect table
(consolidated), rewrite the path before applying. If it matches an entry in the
"Deleted entirely" table, the patch is targeting code that no longer exists — drop
it or escalate.
2. **Check feature mapping above.** If the change adds/modifies a feature listed under
"Feature mapping inside merged files," confirm the merged file's current state — the
change may already be present (in which case it's a no-op), partially present (so you
need to merge carefully), or absent (port it).
3. **For anything calling `compute_qwen3vl_flops_from_config`:** if the change touches
the FLOP computation, re-run the equivalence check (see `[[utils-vfm-vlm-forks]]`
memory for context) before assuming the dynamic batcher calibration still holds.
4. **For `load_config` call sites:** if a patch passes `enable_one_logger=...`, drop
that kwarg — the parameter was removed.
5. **Never** create a new `utils/vfm/vlm/`, `utils/one_logger/`, or `utils/configs/`
directory, and never restore the deleted files listed above. If a patch can't be
cleanly applied to the new layout, stop and ask the user.

## Related memory

`[[utils-vfm-vlm-forks]]` in the project memory captures the consolidation history,
the reasoning behind the leftover `utils/vlm/fused_adam.py`, and the follow-up
deletions in this same session.
43 changes: 43 additions & 0 deletions .config/rumdl.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# https://rumdl.dev/global-settings/
[global]
flavor = "standard"
exclude = [
"ATTRIBUTIONS.md",
"_src",
]
disable = [
"MD013", # line-length
"MD033", # inline-html
"MD040", # fenced-code-language
]

# https://rumdl.dev/rules/

[per-file-ignores]
"README.md" = [
"MD041" # first-line-heading
]

# ul-style
[MD004]
style = "dash"

# table-format
[MD060]
enabled = true
style = "aligned"
32 changes: 32 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# https://coverage.readthedocs.io/en/latest/subprocess.html

[run]
data_file = outputs/coverage/coverage
disable_warnings =
module-not-imported
no-data-collected
parallel = True
patch = subprocess

[report]
exclude_lines =
@overload
def __repr__
if __name__ == .__main__.:
if TYPE_CHECKING:
pragma: no cover
raise AssertionError
raise NotImplementedError
omit =
*_test.py
skip_empty = True
show_missing = True

[html]
directory = outputs/coverage/html

[json]
output = outputs/coverage/coverage.json

[xml]
output = outputs/coverage/coverage.xml
8 changes: 8 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
.venv
.git
/checkpoints
/datasets
/output
/examples/**/checkpoints
/examples/**/output
/examples/**/datasets
30 changes: 30 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
*.lock linguist-generated=true
tests/data/** linguist-generated=true
ATTRIBUTIONS.md linguist-generated=true

assets/** filter=lfs diff=lfs merge=lfs -text

# Video files
*.mp4 filter=lfs diff=lfs merge=lfs -text
*.avi filter=lfs diff=lfs merge=lfs -text
*.mov filter=lfs diff=lfs merge=lfs -text
*.mkv filter=lfs diff=lfs merge=lfs -text
*.webm filter=lfs diff=lfs merge=lfs -text

# Audio files
*.wav filter=lfs diff=lfs merge=lfs -text
*.mp3 filter=lfs diff=lfs merge=lfs -text
*.flac filter=lfs diff=lfs merge=lfs -text
*.aac filter=lfs diff=lfs merge=lfs -text

# Image files
*.jpg filter=lfs diff=lfs merge=lfs -text
*.jpeg filter=lfs diff=lfs merge=lfs -text
*.png filter=lfs diff=lfs merge=lfs -text
*.tiff filter=lfs diff=lfs merge=lfs -text
*.bmp filter=lfs diff=lfs merge=lfs -text

# Logo thumbnail is small and was committed as a regular git blob before
# LFS rules were introduced. Keep it out of LFS to preserve the existing
# blob.
cosmos-logo-thumbnail.png -filter -diff -merge text=auto
65 changes: 65 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
name: Bug Report
about: Report a reproducible bug or unexpected behavior
title: "[BUG] <short description>"
labels: 'bug'
assignees:
- spectralflight
- jeanachoi

---

## Bug Description

<!-- Clear and concise description of the bug. What did you expect? What happened? -->

## Reproduction Steps

```bash
# Minimal command or script to reproduce
```

**Reproducibility:**

- [ ] Always
- [ ] Intermittently (~___% of the time)
- [ ] Only once

## Expected vs. Actual Behavior

| | Description |
| ------------ | --------------------------- |
| **Expected** | What you expected to happen |
| **Actual** | What actually happened |

## Outputs

<details>
<summary>Error / Stack Trace</summary>

<!-- Attach or paste error / stack trace -->

</details>

<details>
<summary>Log Files</summary>

<!-- Attach or paste logs from the output directory -->

</details>

## System Information

| Field | Value |
| ---------------------------- | ------------------------------------------- |
| **Environment** | <!-- e.g. UV, Docker --> |
| **Hardware** | <!-- e.g. DGX H100 x8, single A100 80GB --> |
| **OS** | <!-- e.g. Ubuntu 22.04 / 24.04 --> |
| **GPU Driver** | <!-- e.g. 580.95.05 --> |
| **CUDA Version** | <!-- e.g. 12.8.1 --> |
| **Python Version** | <!-- e.g. 3.13.3 --> |
| **Package Version / Commit** | <!-- e.g. v1.2.3 or git SHA --> |

## Additional Context

<!-- Workarounds tried, related issues, etc. -->
Loading