Conversation
2f1232c to
31ea520
Compare
|
I'm marking this as ready, but it depends on the mentioned PR. |
ac398f0 to
a7828f8
Compare
a7828f8 to
8865107
Compare
|
/ok-to-test 8865107 |
33834f3 to
482aae3
Compare
|
/ok-to-test 482aae3 |
64c9d25 to
126b554
Compare
|
/ok-to-test 126b554 |
9b5317e to
5c01d5b
Compare
|
/ok-to-test 5c01d5b |
|
/ok-to-test 2105c21 |
2105c21 to
2bff2d8
Compare
|
/ok-to-test 2bff2d8 |
2bff2d8 to
482aae3
Compare
| "/dev/nvidia-uvm", | ||
| "/dev/nvidia-uvm-tools", | ||
| "/dev/nvidia-modeset", | ||
| "/dev/dxg", // WSL2: DXG device (GPU via DirectX kernel driver, injected by CDI) |
There was a problem hiding this comment.
@pimlock when considering Tegra-based systems as in #625, the list of device nodes (and other paths) is much longer and are also system dependent. As such, I don't think that hardcoding this list would be feasible. Would it be possible to process the container config instead to get a list of device nodes that we would expect to access?
482aae3 to
921e95c
Compare
921e95c to
e4e76b5
Compare
|
@pimlock I have updated this PR to use the v0.19.1 release of the Device Plugin instead of a SHA. The e2e test failures seem unrelated to the WSL2 changes (although they may be due to the device plugin version bump). |
v0.19.1 includes WSL2 CDI spec compatibility fixes. See NVIDIA/k8s-device-plugin#1671. Signed-off-by: Evan Lezar <elezar@nvidia.com>
On WSL2, NVIDIA GPUs are exposed through the DXG kernel driver (/dev/dxg) rather than the native nvidia* devices. CDI injects /dev/dxg as the sole GPU device node, plus GPU libraries under /usr/lib/wsl/. has_gpu_devices() previously only checked for /dev/nvidiactl, which does not exist on WSL2, so GPU enrichment never ran. This meant /dev/dxg was never permitted by Landlock and /proc write access (required by CUDA for thread naming) was never granted. Fix by: - Extending has_gpu_devices() to also detect /dev/dxg - Adding /dev/dxg to GPU_BASELINE_READ_WRITE (device nodes need O_RDWR) - Adding /usr/lib/wsl to GPU_BASELINE_READ_ONLY for CDI-injected GPU library bind-mounts that may not be covered by the /usr parent rule across filesystem boundaries The existing path existence check in enrich_proto_baseline_paths() ensures all new entries are silently skipped on native Linux where these paths do not exist. Signed-off-by: Evan Lezar <elezar@nvidia.com>
… checks Add ClusterRole and ClusterRoleBinding so the openshell service account can list nodes at the cluster scope, which is required by the GPU node capacity check in the Kubernetes driver. Signed-off-by: Evan Lezar <elezar@nvidia.com>
fb74ed2 to
35935e6
Compare
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Summary
Adds GPU sandbox support for WSL2-based systems. On WSL2, NVIDIA GPUs are exposed through the DXG kernel driver (
/dev/dxg) rather than the nativenvidia*devices, and GPU libraries are injected by CDI into/usr/lib/wsl/rather than standard Linux paths.Two changes are required:
Device plugin version bump — bumps
ghcr.io/nvidia/k8s-device-plugintov0.19.1, which includes upstream fixes for WSL2 CDI spec compatibility. See wsl: report a single "all" device to kubelet k8s-device-plugin#1671.Landlock baseline —
has_gpu_devices()previously only checked for/dev/nvidiactl, which does not exist on WSL2, so GPU enrichment never ran. This left/dev/dxg(the WSL2 GPU device node) and/procwrite access (required by CUDA for thread naming) unpermitted by Landlock. Fixes by extending GPU detection to also check/dev/dxg, adding it to the read-write baseline, and adding/usr/lib/wslto the read-only baseline for CDI-injected GPU libraries.The existing path existence checks in the enrichment logic ensure all new baseline entries are silently skipped on native Linux where these paths do not exist.
Related Issue
Closes #404
Depends on #495 and #503.
Changes
deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml: bump device plugin Helm chart tov0.19.1crates/openshell-sandbox/src/lib.rs: extendhas_gpu_devices()to detect/dev/dxg; add/dev/dxgtoGPU_BASELINE_READ_WRITEand/usr/lib/wsltoGPU_BASELINE_READ_ONLYTesting
mise run pre-commitpassestest_gpu_sandbox_reports_available_gpupasses)Checklist