Yangyangt/try sync with internal#2
Draft
yy-code-nv wants to merge 5 commits into
Draft
Conversation
### Summary CI tests download input assets (e.g. action/video inputs) over the network, and these intermittently fail with transient gateway errors (502/503/504), flaking the run. This PR makes those downloads robust and avoids re-fetching the same assets every run. ### Changes - **Backoff retry** (`inference/common/args.py`): wrap each input download in an outer retry with exponential backoff + jitter (6 attempts, env-overridable via `COSMOS_DOWNLOAD_*`). Permanent errors (400/401/403/404) fail fast. - **Opt-in download cache**: when `COSMOS_DOWNLOAD_CACHE_DIR` is set, downloads are cached by URL and reused across runs; unset → unchanged behavior. Concurrent writers use an atomic move. - **CI wiring** (`gpu-tests.yml`): the `unittest` and `inference-smoke` jobs point at a shared persistent cache dir (`$RUNNER_WORKSPACE/cosmos_input_cache`, outside the repo tree so cleanup keeps it), reused across runs and PRs on the same runner. ### Impact - Production/local behavior unchanged: cache is off unless the env var is set; retry is transparent on success and only adds resilience on failure. - Only new persisted artifact is the cache dir; replaces previously-leaked `/tmp` temp dirs in those jobs. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Remove unused code for config.py (used for old toml config system) - Add vision_sft_nano golden for GB200
## Summary
Adds a **DROID action-policy SFT recipe** for `nvidia/Cosmos3-Nano`,
mirroring the internal `droid_lerobot_8b` policy run, so users can
post-train the action-generation + action heads on DROID (LeRobot v3.0)
data.
## What's included
- **`data/vfm/action/datasets/droid_lerobot_dataset.py`** — DROID
LeRobot dataset: compact columnar load + episode-aware windowing
(replaces an eager full-table materialization), plus `joint_pos` (8D: 7
joints + gripper) and `use_state` support.
- **`data/vfm/action/datasets/action_sft_dataset.py`** (new) —
`get_action_droid_sft_dataset(...)` wrapping the dataset through
`ActionTransformPipeline`.
- **`configs/.../action/posttrain_config/action_policy_droid_nano.py`**
(new) — registered `action_policy_droid_nano` experiment (Cosmos3-Nano /
8B MoT): optimizer trains gen+action heads (5× LR on action heads),
`LambdaLinear` schedule, count-based batch, res480,
`encode_exact_durations=[33]` (chunk 32 → 33 frames).
- **`checkpoint/dcp.py`** — EMA warm-start: when `keys_to_skip_loading`
excludes `net_ema.`, initialize `net_ema = net` from the base weights so
EMA starts from the init rather than zeros.
- **`examples/toml/sft_config/action_policy_droid_{nano,repro}.toml`** —
1-GPU smoke + scaled (res480) configs.
- **`examples/launch_sft_action_policy_droid.sh`** +
**`docs/action_policy_droid_posttraining.md`** — runnable launcher and
walkthrough.
## Validation
End-to-end on H200:
- **1 node / 8×H200** — dry-run + training at res480,
`max_samples_per_batch=32` (64 OOMs at 139 GiB; internal used 128 on
GB200).
- **2 nodes / 16 ranks** — HSDP `shard 8 × replicate 2`, `TRAIN_EXIT=0`.
- Recipe faithful to internal `droid_lerobot_8b`: lr 1e-4 / betas / wd,
5× action-head LR, `LambdaLinear`, shift `{256:3,480:5,720:10}`,
`concat_view`, `chunk_length=32`.
## Notes
- Count-based batch (`max_samples_per_batch`,
`max_sequence_length=None`) lives in the experiment Python — TOML cannot
express `null`, and the loader only overrides keys present in the TOML.
- Base checkpoint: convert `nvidia/Cosmos3-Nano` → DCP and pass via
`BASE_CHECKPOINT_PATH`; action heads init fresh (skipped on load).
---------
Signed-off-by: Hao Liang <haolia@nvidia.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: lfengad <liangf@nvidia.com>
Co-authored-by: Yu-Wei Chao <82182961+ychao-nvidia@users.noreply.github.com>
…ovided (NVIDIA#33) ## Summary `LocalBackend.join_path` accepted `Union[str, Path]` inputs but always returned `str` (via `os.path.join`), even when `Path` objects were passed. This violated the type contract and could cause `AttributeError` downstream. ## Changes - **local_backend.py**: Now checks if any input is a `Path` and returns `Path(result)` accordingly. Removed the stale TODO that acknowledged this issue. - **base_backend.py, easy_io.py, file_client.py**: Updated return type from `str` to `Union[str, Path]`. - **boto3_backend.py, msc_backend.py, http_backend.py**: Updated return type signature for consistency with the abstract base class. ## Related Issue Closes NVIDIA#32 Co-authored-by: Maosheng Liao <maoshengl@nvidia.com>
…line Ran packages/cosmos-framework-release/release.sh against current i4 source: - 451 files in mapping → 152 modified + 15 new in target - Includes COSMOS-RELEASE-* directive application, license/license-header rewrite, redactions, and import rewrites (cosmos.* → cosmos_framework.*). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
90e7ca9 to
21230f9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.