Yangyangt/try sync with internal by yy-code-nv · Pull Request #2 · yy-code-nv/cosmos-framework

yy-code-nv · 2026-06-09T14:54:23Z

No description provided.

### Summary CI tests download input assets (e.g. action/video inputs) over the network, and these intermittently fail with transient gateway errors (502/503/504), flaking the run. This PR makes those downloads robust and avoids re-fetching the same assets every run. ### Changes - **Backoff retry** (`inference/common/args.py`): wrap each input download in an outer retry with exponential backoff + jitter (6 attempts, env-overridable via `COSMOS_DOWNLOAD_*`). Permanent errors (400/401/403/404) fail fast. - **Opt-in download cache**: when `COSMOS_DOWNLOAD_CACHE_DIR` is set, downloads are cached by URL and reused across runs; unset → unchanged behavior. Concurrent writers use an atomic move. - **CI wiring** (`gpu-tests.yml`): the `unittest` and `inference-smoke` jobs point at a shared persistent cache dir (`$RUNNER_WORKSPACE/cosmos_input_cache`, outside the repo tree so cleanup keeps it), reused across runs and PRs on the same runner. ### Impact - Production/local behavior unchanged: cache is off unless the env var is set; retry is transparent on success and only adds resilience on failure. - Only new persisted artifact is the cache dir; replaces previously-leaked `/tmp` temp dirs in those jobs. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Remove unused code for config.py (used for old toml config system) - Add vision_sft_nano golden for GB200

## Summary Adds a **DROID action-policy SFT recipe** for `nvidia/Cosmos3-Nano`, mirroring the internal `droid_lerobot_8b` policy run, so users can post-train the action-generation + action heads on DROID (LeRobot v3.0) data. ## What's included - **`data/vfm/action/datasets/droid_lerobot_dataset.py`** — DROID LeRobot dataset: compact columnar load + episode-aware windowing (replaces an eager full-table materialization), plus `joint_pos` (8D: 7 joints + gripper) and `use_state` support. - **`data/vfm/action/datasets/action_sft_dataset.py`** (new) — `get_action_droid_sft_dataset(...)` wrapping the dataset through `ActionTransformPipeline`. - **`configs/.../action/posttrain_config/action_policy_droid_nano.py`** (new) — registered `action_policy_droid_nano` experiment (Cosmos3-Nano / 8B MoT): optimizer trains gen+action heads (5× LR on action heads), `LambdaLinear` schedule, count-based batch, res480, `encode_exact_durations=[33]` (chunk 32 → 33 frames). - **`checkpoint/dcp.py`** — EMA warm-start: when `keys_to_skip_loading` excludes `net_ema.`, initialize `net_ema = net` from the base weights so EMA starts from the init rather than zeros. - **`examples/toml/sft_config/action_policy_droid_{nano,repro}.toml`** — 1-GPU smoke + scaled (res480) configs. - **`examples/launch_sft_action_policy_droid.sh`** + **`docs/action_policy_droid_posttraining.md`** — runnable launcher and walkthrough. ## Validation End-to-end on H200: - **1 node / 8×H200** — dry-run + training at res480, `max_samples_per_batch=32` (64 OOMs at 139 GiB; internal used 128 on GB200). - **2 nodes / 16 ranks** — HSDP `shard 8 × replicate 2`, `TRAIN_EXIT=0`. - Recipe faithful to internal `droid_lerobot_8b`: lr 1e-4 / betas / wd, 5× action-head LR, `LambdaLinear`, shift `{256:3,480:5,720:10}`, `concat_view`, `chunk_length=32`. ## Notes - Count-based batch (`max_samples_per_batch`, `max_sequence_length=None`) lives in the experiment Python — TOML cannot express `null`, and the loader only overrides keys present in the TOML. - Base checkpoint: convert `nvidia/Cosmos3-Nano` → DCP and pass via `BASE_CHECKPOINT_PATH`; action heads init fresh (skipped on load). --------- Signed-off-by: Hao Liang <haolia@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: lfengad <liangf@nvidia.com> Co-authored-by: Yu-Wei Chao <82182961+ychao-nvidia@users.noreply.github.com>

…ovided (NVIDIA#33) ## Summary `LocalBackend.join_path` accepted `Union[str, Path]` inputs but always returned `str` (via `os.path.join`), even when `Path` objects were passed. This violated the type contract and could cause `AttributeError` downstream. ## Changes - **local_backend.py**: Now checks if any input is a `Path` and returns `Path(result)` accordingly. Removed the stale TODO that acknowledged this issue. - **base_backend.py, easy_io.py, file_client.py**: Updated return type from `str` to `Union[str, Path]`. - **boto3_backend.py, msc_backend.py, http_backend.py**: Updated return type signature for consistency with the abstract base class. ## Related Issue Closes NVIDIA#32 Co-authored-by: Maosheng Liao <maoshengl@nvidia.com>

…line Ran packages/cosmos-framework-release/release.sh against current i4 source: - 451 files in mapping → 152 modified + 15 new in target - Includes COSMOS-RELEASE-* directive application, license/license-header rewrite, redactions, and import rewrites (cosmos.* → cosmos_framework.*). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

lfengad and others added 5 commits June 9, 2026 14:54

Remove unused code; Add golden for GB200 (NVIDIA#28)

55c6276

- Remove unused code for config.py (used for old toml config system) - Add vision_sft_nano golden for GB200

yy-code-nv force-pushed the yangyangt/try_sync_with_internal branch from 90e7ca9 to 21230f9 Compare June 11, 2026 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yangyangt/try sync with internal#2

Yangyangt/try sync with internal#2
yy-code-nv wants to merge 5 commits into
mainfrom
yangyangt/try_sync_with_internal

yy-code-nv commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yy-code-nv commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants