feat(gpu): pre-bake NVIDIA CUDA kernel module into the VHD#8661
feat(gpu): pre-bake NVIDIA CUDA kernel module into the VHD#8661ganeshkumarashok wants to merge 1 commit into
Conversation
Move the ~100s in-CSE NVIDIA DKMS kernel-module compile off the node provisioning critical path by compiling it at VHD build time. VHD build (install-dependencies.sh), opt-in via FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE: after pre-pulling the aks-gpu-cuda image, run the container in build-only mode (/entrypoint.sh build-only). aks-gpu compiles + DKMS-registers the kernel module and stages userspace libs against the VHD's kernel with NO device access (safe on the GPU-less Packer builder) and writes /opt/azure/aks-gpu/dkms-marker. Node boot (cse_config.sh): configGPUDrivers passes install-skip-build when the marker is present, so aks-gpu skips the recompile and runs only the device-init steps. aks-gpu re-validates the marker (kernel+version+kind) and falls back to a full build on mismatch (kernel upgrade / shared VHD with a different driver kind), so behaviour is unchanged on non-prebaked VHDs (no marker -> install). The driver image is intentionally kept in the VHD: boot-time device init still sources the container toolkit, fabric manager, containerd runtime config and udev rules from it. Dropping the image is a separate, deferred size optimization. Requires the aks-gpu image that supports build-only/install-skip-build (PR #159). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR moves NVIDIA CUDA DKMS kernel-module compilation out of the node-provisioning (CSE) critical path by optionally pre-building the module during VHD bake, and then selecting a skip-build install path at node boot when a prebake marker is present.
Changes:
- Add an opt-in (
FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE) VHD-build step to run theaks-gpu-cudacontainer inbuild-onlymode and require a generated DKMS marker. - Update
configGPUDrivers(Ubuntu path) to chooseinstall-skip-buildwhen the marker exists, otherwise default toinstall. - Add ShellSpec coverage validating the marker → action selection logic.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| vhdbuilder/packer/install-dependencies.sh | Adds an opt-in prebake flow that runs the GPU container at image build time and verifies a DKMS marker. |
| parts/linux/cloud-init/artifacts/cse_config.sh | Selects install vs install-skip-build action for the GPU install container based on presence of a marker file. |
| spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh | Adds ShellSpec tests to validate the action selection behavior. |
| if grep -q "NVIDIA_CUDA_PREBAKE" <<< "$FEATURE_FLAGS"; then | ||
| echo "Pre-building NVIDIA CUDA kernel module into the VHD (build-only) for kernel $(uname -r)" | ||
| CTR_GPU_PREBUILD_CMD="ctr -n k8s.io run --privileged --rm --net-host --with-ns pid:/proc/1/ns/pid --mount type=bind,src=/opt/gpu,dst=/mnt/gpu,options=rbind --mount type=bind,src=/opt/actions,dst=/mnt/actions,options=rbind" | ||
| retrycmd_if_failure 3 10 600 bash -c "$CTR_GPU_PREBUILD_CMD $NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG gpuprebuild /entrypoint.sh build-only" || exit 1 |
| # Capture the action passed to the install container. | ||
| retrycmd_if_failure() { shift 3; echo "INSTALL_CMD: $*"; return 0; } | ||
|
|
||
| BeforeEach 'OS="$UBUNTU_OS_NAME"; NVIDIA_DRIVER_IMAGE="mcr.microsoft.com/aks/aks-gpu-cuda"; NVIDIA_DRIVER_IMAGE_TAG="580.0.0"; CTR_GPU_INSTALL_CMD="ctr-run"; GPU_DKMS_MARKER_FILE="$(mktemp -u)"' |
|
AgentBaker Linux PR gate — E2E failure (shared test-fixture issue, NOT this PR)
Confirmed systemic — not caused by this PR. The identical sub-7s empty-error shape on the same PR #8661 only changes Confidence: HIGH that this PR is not the cause. Recommended: rerun once; do not block merge on this signal. Owner: NodeSIG-dev / E2E infra to triage the Strongest alternative (less likely): transient ACR-private-endpoint outage — refuted by the consistent sub-7s timing across 72h on multiple PRs. Side-channel (not the cause, FYI): Posted by Clawpilot AgentBaker gate detective. |
What
Move the ~100s in-CSE NVIDIA DKMS kernel-module compile off the node-provisioning critical path by compiling it at VHD build time.
install-dependencies.sh), opt-in viaFEATURE_FLAGS=NVIDIA_CUDA_PREBAKE: after pre-pulling theaks-gpu-cudaimage, run it in build-only mode. aks-gpu compiles + DKMS-registers the kernel module and stages userspace libs against the VHD's kernel with no device access (safe on the GPU-less Packer builders, which areStandard_D16ds_v5-class), and writes/opt/azure/aks-gpu/dkms-marker.cse_config.sh):configGPUDriverspasses install-skip-build when the marker is present, so aks-gpu skips the recompile and runs only device-init. aks-gpu re-validates the marker (kernel+version+kind) and falls back to a full build on mismatch, so behaviour is unchanged on non-prebaked VHDs (no marker →install).Why
The DKMS compile dominates GPU driver install (~100s) on every GPU node boot. Pre-baking removes it from the critical path. Works transparently on GPU pools (which default to
ConfigGPUDriverIfNeeded=true).Notes
FEATURE_FLAGS=NVIDIA_CUDA_PREBAKEis needed to actually produce the prebaked VHD (not in this PR).Tests
make generatetestdata: no drift.verify_shell: baseline unchanged.