feat: build-only / install-skip-build modes for VHD-prebuilt GPU kernel module#162
feat: build-only / install-skip-build modes for VHD-prebuilt GPU kernel module#162ganeshkumarashok wants to merge 3 commits into
Conversation
…uilt kernel module Split the host-side driver install into two phases so the NVIDIA kernel module can be DKMS-compiled into the VHD at image build time and the boot-time install can skip straight to device init: - install.sh: refactor into build_kernel_module() (compile + stage userspace libs, no device access) and device_init() (modprobe, nvidia-smi, fabric manager, containerd config, udev). Add AKSGPU_BUILD_ONLY and AKSGPU_SKIP_KERNEL_BUILD modes, an overlay cleanup trap, and a dkms-marker (/opt/azure/aks-gpu/dkms-marker) recording kernel, driver_version, driver_kind and arch so the consumer (AgentBaker CSE) can validate an exact match before taking the skip-build fast path. - entrypoint.sh: add build-only and install-skip-build actions and pass the mode through to the host via nsenter. The default install action is unchanged. This is the aks-gpu half of the AgentBaker change that prebuilds the GPU kernel module into the VHD to reduce node provisioning time. Secure Boot module signing and GPU e2e validation are still required. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The skip-kernel-build fast path now requires a matching dkms-marker (kernel + driver_version + driver_kind); on mismatch it falls back to a full build. Before any full (re)build, remove_stale_baked_driver unloads loaded nvidia modules and removes a stale registered DKMS tree and its relocated libs/loader config, so a CUDA-baked VHD booting on a GRID node (or a version skew) cannot collide with the boot-time nvidia-installer. No-op on today's VHDs (nothing baked / nothing registered). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nt; fix EXIT trap - entrypoint.sh: collapse the per-action cp/echo duplication into a single staging block; the case now only maps action -> install-mode env var. - install.sh: the second `trap ... EXIT` (cleanup_overlay) was silently replacing the earlier `trap 'PS4="+ "' exit`; fold both into one EXIT trap. - install.sh: only run remove_stale_baked_driver when a baked marker is present, so the default `install` path (non-prebaked VHD, no marker) is byte-for-byte the legacy behaviour and never touches DKMS cleanup. The cleanup still runs for the reachable mismatch case (marker present but kind/version/kernel differs). - Correct comments that framed the mismatch path as defensive / "no-op on today's VHDs"; it is reachable because AgentBaker gates skip-build on marker presence alone. No functional change to build-only or the matching skip-build fast path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Simplification pass (commit ffb671b)I went looking for "more complex than necessary" and made three targeted changes:
Why I did not remove the stale-driver cleanupMy first instinct was that No functional change to |
What
Split
install.shso the expensive NVIDIA DKMS kernel-module compile can be hoisted to VHD build time, removing it from the node-provisioning critical path.build_kernel_module(device-independent: nouveau blacklist, overlayfs +nvidia-installer --dkms, lib staging, ld config) is separated fromdevice_init(device-dependent:nvidia-modprobe/nvidia-smi, fabric manager, containerd runtime config, dev-char symlinks).entrypoint.sh:build-only(AKSGPU_BUILD_ONLY=1): compile + DKMS-register + stage libs, no device access (safe on a GPU-less Packer builder), write/opt/azure/aks-gpu/dkms-marker.install-skip-build(AKSGPU_SKIP_KERNEL_BUILD=1): skip recompile, run only device-init at node boot.install: unchanged legacy full path.remove_stale_baked_driverunloads loaded nvidia modules and removes a stale registered DKMS tree + relocated libs before a full rebuild, so a CUDA-baked VHD booting on a GRID node can't collide with the boot-time installer. No-op on today's VHDs.Why
The
nvidia-installer --dkmscompile is ~100s on the GPU provisioning critical path. Pre-baking it into the VHD removes it from every node boot.Consumer
AgentBaker uses these modes (build-only at VHD bake, install-skip-build at CSE): Azure/AgentBaker PR for
gpu-prebake-cuda-driver.Safety
Default behaviour is unchanged when neither env var is set and no marker exists.