Skip to content

fix(gpu): select single CDI GPU defaults#1675

Open
elezar wants to merge 1 commit into
mainfrom
fix/1477-single-cdi-gpu-defaults-elezar
Open

fix(gpu): select single CDI GPU defaults#1675
elezar wants to merge 1 commit into
mainfrom
fix/1477-single-cdi-gpu-defaults-elezar

Conversation

@elezar

@elezar elezar commented Jun 2, 2026

Copy link
Copy Markdown
Member

Summary

Updates Docker and Podman GPU handling so a bare GPU request selects one concrete NVIDIA CDI device when possible.

Default nvidia.com/gpu=all fallback is allowed only for WSL2 all-only compatibility. Explicit GPU device requests pass through unchanged, including nvidia.com/gpu=all.

Related Issue

Closes #1477

Changes

  • Adds shared CDI inventory and round-robin default selection helpers.
  • Docker uses daemon CDI inventory from DiscoveredDevices and Docker /info to allow WSL2 all-only fallback.
  • Podman builds local CDI inventory from /dev/nvidiaN device nodes and uses /dev/dxg for WSL2 all-only fallback.
  • Keeps explicit CDI GPU device requests unchanged.
  • Updates GPU docs and optional GPU e2e expectations.

Testing

  • mise exec -- cargo fmt --check
  • mise exec -- cargo check -p openshell-core -p openshell-driver-docker -p openshell-driver-podman --tests
  • mise exec -- cargo test -p openshell-core -p openshell-driver-docker -p openshell-driver-podman gpu --tests
  • mise exec -- cargo test -p openshell-driver-docker -p openshell-driver-podman wsl2 --tests
  • mise exec -- cargo test -p openshell-core -p openshell-driver-podman all_only --tests
  • mise run pre-commit

Checklist

  • Follows Conventional Commits
  • Documentation updated

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

@elezar elezar marked this pull request as draft June 3, 2026 13:55
@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@elezar elezar changed the base branch from main to feat/gpu-request-spec June 3, 2026 13:58
@elezar elezar force-pushed the fix/1477-single-cdi-gpu-defaults-elezar branch from de229a4 to 7fa3cc6 Compare June 3, 2026 14:03
@elezar elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 4c18100 to cec0e21 Compare June 4, 2026 07:54
@elezar elezar force-pushed the fix/1477-single-cdi-gpu-defaults-elezar branch from 7fa3cc6 to 16fe7a2 Compare June 4, 2026 14:47
@copy-pr-bot

copy-pr-bot Bot commented Jun 4, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@elezar elezar changed the base branch from feat/gpu-request-spec to main June 4, 2026 14:47
@elezar elezar marked this pull request as ready for review June 4, 2026 14:51
@elezar elezar force-pushed the fix/1477-single-cdi-gpu-defaults-elezar branch from 16fe7a2 to 4cedfec Compare June 4, 2026 15:49
Kubernetes mirrors each limit into the matching request. VM accepts the fields
but currently ignores them.

GPU requests enter the driver layer through `SandboxSpec.gpu` and

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will change in #1156

@elezar elezar force-pushed the fix/1477-single-cdi-gpu-defaults-elezar branch from 4cedfec to bbdc514 Compare June 5, 2026 07:11
@elezar elezar added gator:in-review Gator is reviewing or awaiting PR review feedback test:e2e-gpu Requires GPU end-to-end coverage labels Jun 10, 2026
@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

PR Review Status

Validation: this is maintainer-authored, project-valid GPU runtime work that addresses #1477 by making default Docker/Podman GPU requests select a concrete CDI device when possible.
Head SHA: bbdc51489c70995147bcae4298d82fe522c70d0e

Review findings:

  • Blocking: crates/openshell-driver-podman/src/container.rs appears to leave build_container_spec() calling build_container_spec_with_token() while the callee is gated behind #[cfg(test)], which would break non-test Podman driver compilation. Please either ungate build_container_spec_with_token() or gate the caller if it is test-only.
  • Warning: Docker WSL2 detection for permitting default nvidia.com/gpu=all is broad because it considers daemon name and arbitrary labels containing wsl. Please restrict the fallback to stronger OS/kernel indicators so a bare GPU request does not unexpectedly grant all GPUs.
  • Suggested coverage: add a driver-level test that two default GPU creates consume different devices from a multi-GPU inventory.

Docs: Fern docs were updated on the existing sandbox management page; no navigation change appears needed.
E2E: test:e2e-gpu is being applied because this changes GPU runtime/CDI behavior.

Next state: gator:in-review

@github-actions

Copy link
Copy Markdown

Label test:e2e-gpu applied for bbdc514. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute GPU E2E after building the required supervisor image once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

@elezar elezar force-pushed the fix/1477-single-cdi-gpu-defaults-elezar branch from bbdc514 to 2d73188 Compare June 10, 2026 18:06
@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head 2d731882a278cb3f399dccde5b0dd54bcbd55f76 after the new commit fix(gpu): prefer single CDI devices for local runtimes.

Disposition: partially resolved.

Remaining items:

  • Resolved: the prior Podman #[cfg(test)] compile concern is addressed. Non-test creation now calls build_container_spec_with_token_and_gpu_default, while the old wrappers are test-only.
  • Resolved: production Docker WSL2 fallback detection is narrowed and no longer trusts daemon name or arbitrary labels.
  • Remaining warning: e2e/rust/tests/gpu_device_selection.rs still mirrors the older broad WSL2 heuristic in test expectations, including OSType, daemon Name, or Labels. Please mirror the production predicate so E2E expectations cannot disagree with the runtime behavior.
  • Remaining suggested coverage: add equivalent Podman driver-level two-create round-robin coverage using for_tests_with_gpu_inventory, matching the Docker coverage added in this head.

Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, and OpenShell / GPU E2E are green for the current head.

Next state: gator:in-review

@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Blocked

Gator is blocked by merge conflicts: GitHub reports mergeStateStatus=DIRTY for head 2d731882a278cb3f399dccde5b0dd54bcbd55f76, so the PR cannot be merged even though the required CI gates are green.

Next action: @elezar update the branch against current main and resolve the conflicts. The existing review-loop items from the latest gator re-check also remain open after the conflict is resolved:

  • Mirror the production Docker WSL2 fallback predicate in e2e/rust/tests/gpu_device_selection.rs.
  • Add equivalent Podman driver-level two-create round-robin coverage using for_tests_with_gpu_inventory.

Next state: gator:blocked

@elezar elezar added gator:blocked Gator is blocked by process or repository gates and removed gator:in-review Gator is reviewing or awaiting PR review feedback labels Jun 10, 2026
Prefer a single CDI-qualified device when Docker or Podman resolves the default GPU request to one GPU.

Allow nvidia.com/gpu=all only as a WSL2 all-only compatibility fallback, using Docker daemon info and Podman's /dev/dxg probe to identify that case.

Update driver docs, architecture notes, and GPU e2e coverage for the default selection behavior.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the fix/1477-single-cdi-gpu-defaults-elezar branch from 2d73188 to d1f5410 Compare June 10, 2026 21:41
@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head d1f54107564e2b4ada77929f0332bfc8c9a722d3 after the branch update that followed the prior merge-conflict blocker.

Disposition: resolved for the blocker and prior review-loop items.

Remaining items:

  • GitHub no longer reports the prior mergeStateStatus=DIRTY conflict blocker; this head is blocked only by normal required gates/review status.
  • The previous E2E WSL2 predicate concern is resolved: the E2E expectation now mirrors production by using kernel/operating-system fields and ignoring daemon name, labels, and OSType.
  • The previous Podman coverage concern is resolved with driver-level two-create round-robin coverage using for_tests_with_gpu_inventory.
  • Nonblocking residual note: the implementation fails bare --gpu when Docker CDI inventory is empty or missing unless WSL2 all-only fallback is detected. The linked issue mentions a possible nvidia.com/gpu=0 fallback for unavailable inventory; if maintainers still want that behavior, it should be handled explicitly before or after merge.

Checks: OpenShell / Branch Checks and OpenShell / GPU E2E are pending for the current head; test:e2e-gpu remains applied.

Next state: gator:watch-pipeline

@elezar elezar added gator:watch-pipeline Gator is monitoring PR CI/CD status gator:approval-needed Gator completed review; maintainer approval needed and removed gator:blocked Gator is blocked by process or repository gates gator:watch-pipeline Gator is monitoring PR CI/CD status labels Jun 10, 2026
@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Maintainer Approval Needed

Gator validation and PR monitoring are complete for latest head d1f54107564e2b4ada77929f0332bfc8c9a722d3.

Validation: maintainer-authored GPU runtime work that addresses #1477 by making default Docker/Podman GPU requests select a concrete CDI device when possible.
Review: I re-checked after the branch update; the prior merge-conflict blocker and review-loop items are resolved. No blocking review items remain. Nonblocking residual note: the implementation fails bare --gpu when Docker CDI inventory is empty or missing unless WSL2 all-only fallback is detected; if maintainers still want a nvidia.com/gpu=0 fallback for unavailable inventory, that should be handled explicitly before or after merge.
Docs: Fern docs were updated on the existing sandbox management page; no navigation change appears needed.
Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, and OpenShell / GPU E2E are green for the current head.
E2E: test:e2e-gpu is applied and the GPU E2E required gate is green.

Human maintainer approval or merge decision is now required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gator:approval-needed Gator completed review; maintainer approval needed test:e2e-gpu Requires GPU end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(gpu): select one CDI GPU by default for Docker and Podman

1 participant