Skip to content

feat(gpu): pre-bake NVIDIA CUDA kernel module into the VHD#8661

Open
ganeshkumarashok wants to merge 1 commit into
mainfrom
gpu-prebake-cuda-driver
Open

feat(gpu): pre-bake NVIDIA CUDA kernel module into the VHD#8661
ganeshkumarashok wants to merge 1 commit into
mainfrom
gpu-prebake-cuda-driver

Conversation

@ganeshkumarashok

Copy link
Copy Markdown
Contributor

What

Move the ~100s in-CSE NVIDIA DKMS kernel-module compile off the node-provisioning critical path by compiling it at VHD build time.

  • VHD build (install-dependencies.sh), opt-in via FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE: after pre-pulling the aks-gpu-cuda image, run it in build-only mode. aks-gpu compiles + DKMS-registers the kernel module and stages userspace libs against the VHD's kernel with no device access (safe on the GPU-less Packer builders, which are Standard_D16ds_v5-class), and writes /opt/azure/aks-gpu/dkms-marker.
  • Node boot (cse_config.sh): configGPUDrivers passes install-skip-build when the marker is present, so aks-gpu skips the recompile and runs only device-init. aks-gpu re-validates the marker (kernel+version+kind) and falls back to a full build on mismatch, so behaviour is unchanged on non-prebaked VHDs (no marker → install).

Why

The DKMS compile dominates GPU driver install (~100s) on every GPU node boot. Pre-baking removes it from the critical path. Works transparently on GPU pools (which default to ConfigGPUDriverIfNeeded=true).

Notes

  • Depends on the aks-gpu image that supports build-only/install-skip-build: feat: build-only / install-skip-build modes for VHD-prebuilt GPU kernel module aks-gpu#162.
  • The driver image is intentionally kept in the VHD — boot device-init still sources the container toolkit, fabric manager, containerd runtime config and udev rules from it. Dropping it for VHD-size savings is a separate, deferred follow-up.
  • A pipeline SKU job that sets FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE is needed to actually produce the prebaked VHD (not in this PR).

Tests

  • make generate testdata: no drift.
  • verify_shell: baseline unchanged.
  • 2 new ShellSpec examples for the marker → action selection (full suite: +2 examples, +0 new failures).

Move the ~100s in-CSE NVIDIA DKMS kernel-module compile off the node
provisioning critical path by compiling it at VHD build time.

VHD build (install-dependencies.sh), opt-in via FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE:
after pre-pulling the aks-gpu-cuda image, run the container in build-only mode
(/entrypoint.sh build-only). aks-gpu compiles + DKMS-registers the kernel module
and stages userspace libs against the VHD's kernel with NO device access (safe on
the GPU-less Packer builder) and writes /opt/azure/aks-gpu/dkms-marker.

Node boot (cse_config.sh): configGPUDrivers passes install-skip-build when the
marker is present, so aks-gpu skips the recompile and runs only the device-init
steps. aks-gpu re-validates the marker (kernel+version+kind) and falls back to a
full build on mismatch (kernel upgrade / shared VHD with a different driver kind),
so behaviour is unchanged on non-prebaked VHDs (no marker -> install).

The driver image is intentionally kept in the VHD: boot-time device init still
sources the container toolkit, fabric manager, containerd runtime config and udev
rules from it. Dropping the image is a separate, deferred size optimization.

Requires the aks-gpu image that supports build-only/install-skip-build (PR #159).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR moves NVIDIA CUDA DKMS kernel-module compilation out of the node-provisioning (CSE) critical path by optionally pre-building the module during VHD bake, and then selecting a skip-build install path at node boot when a prebake marker is present.

Changes:

  • Add an opt-in (FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE) VHD-build step to run the aks-gpu-cuda container in build-only mode and require a generated DKMS marker.
  • Update configGPUDrivers (Ubuntu path) to choose install-skip-build when the marker exists, otherwise default to install.
  • Add ShellSpec coverage validating the marker → action selection logic.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
vhdbuilder/packer/install-dependencies.sh Adds an opt-in prebake flow that runs the GPU container at image build time and verifies a DKMS marker.
parts/linux/cloud-init/artifacts/cse_config.sh Selects install vs install-skip-build action for the GPU install container based on presence of a marker file.
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Adds ShellSpec tests to validate the action selection behavior.

Comment on lines +731 to +734
if grep -q "NVIDIA_CUDA_PREBAKE" <<< "$FEATURE_FLAGS"; then
echo "Pre-building NVIDIA CUDA kernel module into the VHD (build-only) for kernel $(uname -r)"
CTR_GPU_PREBUILD_CMD="ctr -n k8s.io run --privileged --rm --net-host --with-ns pid:/proc/1/ns/pid --mount type=bind,src=/opt/gpu,dst=/mnt/gpu,options=rbind --mount type=bind,src=/opt/actions,dst=/mnt/actions,options=rbind"
retrycmd_if_failure 3 10 600 bash -c "$CTR_GPU_PREBUILD_CMD $NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG gpuprebuild /entrypoint.sh build-only" || exit 1
# Capture the action passed to the install container.
retrycmd_if_failure() { shift 3; echo "INSTALL_CMD: $*"; return 0; }

BeforeEach 'OS="$UBUNTU_OS_NAME"; NVIDIA_DRIVER_IMAGE="mcr.microsoft.com/aks/aks-gpu-cuda"; NVIDIA_DRIVER_IMAGE_TAG="580.0.0"; CTR_GPU_INSTALL_CMD="ctr-run"; GPU_DKMS_MARKER_FILE="$(mktemp -u)"'
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — E2E failure (shared test-fixture issue, NOT this PR)

  • Run: 167191538 (failed)
  • Failed task: Run AgentBaker E2E → AzureCLI exit 1 (DONE 457 tests, 95 skipped, 3 failures in 1598.65s)
  • All 3 failures on the same parent test:
    • Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated/default (3.41s) — test_helpers.go:227 🔴 empty error
    • Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated/scriptless_nbc (2.58s) — same
    • Parent container

Confirmed systemic — not caused by this PR. The identical sub-7s empty-error shape on the same ImagePullIdentityBinding_NetworkIsolated scenario has now hit 6+ unrelated PRs in the last 72h: #8330 (trivy), #8600 (kubelet-kubectl), #8659 (nvidia-dcgm), #8654 / #8653 (STLS), and now this PR.

PR #8661 only changes cse_config.sh (+10/-1), cse_config_spec.sh (+33), and vhdbuilder/packer/install-dependencies.sh (+22) to pre-bake the NVIDIA CUDA kernel module. None of these touch image-pull identity binding, ACR private endpoints, or NetworkIsolated test wiring.

Confidence: HIGH that this PR is not the cause. Recommended: rerun once; do not block merge on this signal. Owner: NodeSIG-dev / E2E infra to triage the ImagePullIdentityBinding_NetworkIsolated fixture (sub-7s empty failures = pre-VMSS / private-cluster precondition error — likely ACR-with-private-endpoint, identity RBAC, or subnet setup).

Strongest alternative (less likely): transient ACR-private-endpoint outage — refuted by the consistent sub-7s timing across 72h on multiple PRs.

Side-channel (not the cause, FYI): build2404gen2containerd SSH service configuration warning is a recurring non-fatal Packer log artifact, ignored.

Posted by Clawpilot AgentBaker gate detective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants