Skip to content

Yangyangt/try sync with internal#2

Draft
yy-code-nv wants to merge 5 commits into
mainfrom
yangyangt/try_sync_with_internal
Draft

Yangyangt/try sync with internal#2
yy-code-nv wants to merge 5 commits into
mainfrom
yangyangt/try_sync_with_internal

Conversation

@yy-code-nv

Copy link
Copy Markdown
Owner

No description provided.

lfengad and others added 5 commits June 9, 2026 14:54
### Summary
CI tests download input assets (e.g. action/video inputs) over the
network, and these intermittently fail with transient gateway errors
(502/503/504), flaking
the run. This PR makes those downloads robust and avoids re-fetching the
same assets every run.
### Changes
- **Backoff retry** (`inference/common/args.py`): wrap each input
download in an outer retry with exponential backoff + jitter (6
attempts, env-overridable via
`COSMOS_DOWNLOAD_*`). Permanent errors (400/401/403/404) fail fast.
- **Opt-in download cache**: when `COSMOS_DOWNLOAD_CACHE_DIR` is set,
downloads are cached by URL and reused across runs; unset → unchanged
behavior.
Concurrent writers use an atomic move.
- **CI wiring** (`gpu-tests.yml`): the `unittest` and `inference-smoke`
jobs point at a shared persistent cache dir
(`$RUNNER_WORKSPACE/cosmos_input_cache`,
outside the repo tree so cleanup keeps it), reused across runs and PRs
on the same runner.
### Impact
- Production/local behavior unchanged: cache is off unless the env var
is set; retry is transparent on success and only adds resilience on
failure.
- Only new persisted artifact is the cache dir; replaces
previously-leaked `/tmp` temp dirs in those jobs.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Remove unused code for config.py (used for old toml config system)
- Add vision_sft_nano golden for GB200
## Summary

Adds a **DROID action-policy SFT recipe** for `nvidia/Cosmos3-Nano`,
mirroring the internal `droid_lerobot_8b` policy run, so users can
post-train the action-generation + action heads on DROID (LeRobot v3.0)
data.

## What's included

- **`data/vfm/action/datasets/droid_lerobot_dataset.py`** — DROID
LeRobot dataset: compact columnar load + episode-aware windowing
(replaces an eager full-table materialization), plus `joint_pos` (8D: 7
joints + gripper) and `use_state` support.
- **`data/vfm/action/datasets/action_sft_dataset.py`** (new) —
`get_action_droid_sft_dataset(...)` wrapping the dataset through
`ActionTransformPipeline`.
- **`configs/.../action/posttrain_config/action_policy_droid_nano.py`**
(new) — registered `action_policy_droid_nano` experiment (Cosmos3-Nano /
8B MoT): optimizer trains gen+action heads (5× LR on action heads),
`LambdaLinear` schedule, count-based batch, res480,
`encode_exact_durations=[33]` (chunk 32 → 33 frames).
- **`checkpoint/dcp.py`** — EMA warm-start: when `keys_to_skip_loading`
excludes `net_ema.`, initialize `net_ema = net` from the base weights so
EMA starts from the init rather than zeros.
- **`examples/toml/sft_config/action_policy_droid_{nano,repro}.toml`** —
1-GPU smoke + scaled (res480) configs.
- **`examples/launch_sft_action_policy_droid.sh`** +
**`docs/action_policy_droid_posttraining.md`** — runnable launcher and
walkthrough.

## Validation

End-to-end on H200:
- **1 node / 8×H200** — dry-run + training at res480,
`max_samples_per_batch=32` (64 OOMs at 139 GiB; internal used 128 on
GB200).
- **2 nodes / 16 ranks** — HSDP `shard 8 × replicate 2`, `TRAIN_EXIT=0`.
- Recipe faithful to internal `droid_lerobot_8b`: lr 1e-4 / betas / wd,
5× action-head LR, `LambdaLinear`, shift `{256:3,480:5,720:10}`,
`concat_view`, `chunk_length=32`.

## Notes

- Count-based batch (`max_samples_per_batch`,
`max_sequence_length=None`) lives in the experiment Python — TOML cannot
express `null`, and the loader only overrides keys present in the TOML.
- Base checkpoint: convert `nvidia/Cosmos3-Nano` → DCP and pass via
`BASE_CHECKPOINT_PATH`; action heads init fresh (skipped on load).

---------

Signed-off-by: Hao Liang <haolia@nvidia.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: lfengad <liangf@nvidia.com>
Co-authored-by: Yu-Wei Chao <82182961+ychao-nvidia@users.noreply.github.com>
…ovided (NVIDIA#33)

## Summary

`LocalBackend.join_path` accepted `Union[str, Path]` inputs but always
returned `str` (via `os.path.join`), even when `Path` objects were
passed. This violated the type contract and could cause `AttributeError`
downstream.

## Changes

- **local_backend.py**: Now checks if any input is a `Path` and returns
`Path(result)` accordingly. Removed the stale TODO that acknowledged
this issue.
- **base_backend.py, easy_io.py, file_client.py**: Updated return type
from `str` to `Union[str, Path]`.
- **boto3_backend.py, msc_backend.py, http_backend.py**: Updated return
type signature for consistency with the abstract base class.

## Related Issue

Closes NVIDIA#32

Co-authored-by: Maosheng Liao <maoshengl@nvidia.com>
…line

Ran packages/cosmos-framework-release/release.sh against current i4 source:
- 451 files in mapping → 152 modified + 15 new in target
- Includes COSMOS-RELEASE-* directive application, license/license-header
  rewrite, redactions, and import rewrites (cosmos.* → cosmos_framework.*).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@yy-code-nv yy-code-nv force-pushed the yangyangt/try_sync_with_internal branch from 90e7ca9 to 21230f9 Compare June 11, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants