Skip to content

feat: build-only / install-skip-build modes for VHD-prebuilt GPU kernel module#162

Open
ganeshkumarashok wants to merge 3 commits into
mainfrom
gpu-prebuild-kernel-module
Open

feat: build-only / install-skip-build modes for VHD-prebuilt GPU kernel module#162
ganeshkumarashok wants to merge 3 commits into
mainfrom
gpu-prebuild-kernel-module

Conversation

@ganeshkumarashok

Copy link
Copy Markdown
Collaborator

What

Split install.sh so the expensive NVIDIA DKMS kernel-module compile can be hoisted to VHD build time, removing it from the node-provisioning critical path.

  • build_kernel_module (device-independent: nouveau blacklist, overlayfs + nvidia-installer --dkms, lib staging, ld config) is separated from device_init (device-dependent: nvidia-modprobe/nvidia-smi, fabric manager, containerd runtime config, dev-char symlinks).
  • New modes via entrypoint.sh:
    • build-only (AKSGPU_BUILD_ONLY=1): compile + DKMS-register + stage libs, no device access (safe on a GPU-less Packer builder), write /opt/azure/aks-gpu/dkms-marker.
    • install-skip-build (AKSGPU_SKIP_KERNEL_BUILD=1): skip recompile, run only device-init at node boot.
    • default install: unchanged legacy full path.
  • Marker re-validation (kernel + driver_version + driver_kind) gates the skip-build fast path; any mismatch (kernel upgrade, or a shared VHD whose baked driver kind/version differs) falls back to a full build.
  • remove_stale_baked_driver unloads loaded nvidia modules and removes a stale registered DKMS tree + relocated libs before a full rebuild, so a CUDA-baked VHD booting on a GRID node can't collide with the boot-time installer. No-op on today's VHDs.

Why

The nvidia-installer --dkms compile is ~100s on the GPU provisioning critical path. Pre-baking it into the VHD removes it from every node boot.

Consumer

AgentBaker uses these modes (build-only at VHD bake, install-skip-build at CSE): Azure/AgentBaker PR for gpu-prebake-cuda-driver.

Safety

Default behaviour is unchanged when neither env var is set and no marker exists.

ganeshkumarashok and others added 2 commits May 31, 2026 11:55
…uilt kernel module

Split the host-side driver install into two phases so the NVIDIA kernel module
can be DKMS-compiled into the VHD at image build time and the boot-time install
can skip straight to device init:

- install.sh: refactor into build_kernel_module() (compile + stage userspace
  libs, no device access) and device_init() (modprobe, nvidia-smi, fabric
  manager, containerd config, udev). Add AKSGPU_BUILD_ONLY and
  AKSGPU_SKIP_KERNEL_BUILD modes, an overlay cleanup trap, and a dkms-marker
  (/opt/azure/aks-gpu/dkms-marker) recording kernel, driver_version,
  driver_kind and arch so the consumer (AgentBaker CSE) can validate an exact
  match before taking the skip-build fast path.
- entrypoint.sh: add build-only and install-skip-build actions and pass the
  mode through to the host via nsenter. The default install action is
  unchanged.

This is the aks-gpu half of the AgentBaker change that prebuilds the GPU kernel
module into the VHD to reduce node provisioning time. Secure Boot module
signing and GPU e2e validation are still required.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The skip-kernel-build fast path now requires a matching dkms-marker
(kernel + driver_version + driver_kind); on mismatch it falls back to a
full build. Before any full (re)build, remove_stale_baked_driver unloads
loaded nvidia modules and removes a stale registered DKMS tree and its
relocated libs/loader config, so a CUDA-baked VHD booting on a GRID node
(or a version skew) cannot collide with the boot-time nvidia-installer.

No-op on today's VHDs (nothing baked / nothing registered).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nt; fix EXIT trap

- entrypoint.sh: collapse the per-action cp/echo duplication into a single
  staging block; the case now only maps action -> install-mode env var.
- install.sh: the second `trap ... EXIT` (cleanup_overlay) was silently
  replacing the earlier `trap 'PS4="+ "' exit`; fold both into one EXIT trap.
- install.sh: only run remove_stale_baked_driver when a baked marker is present,
  so the default `install` path (non-prebaked VHD, no marker) is byte-for-byte
  the legacy behaviour and never touches DKMS cleanup. The cleanup still runs
  for the reachable mismatch case (marker present but kind/version/kernel differs).
- Correct comments that framed the mismatch path as defensive / "no-op on
  today's VHDs"; it is reachable because AgentBaker gates skip-build on marker
  presence alone.

No functional change to build-only or the matching skip-build fast path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ganeshkumarashok

Copy link
Copy Markdown
Collaborator Author

Simplification pass (commit ffb671b)

I went looking for "more complex than necessary" and made three targeted changes:

  1. entrypoint.sh — dedup. The three case arms each repeated the same cp -a /opt/gpu/. /mnt/gpu/ + echo. Collapsed into one staging block; the case now only maps action → install-mode env var. (−10 lines, same behaviour.)

  2. install.sh — fixed a real (if cosmetic) bug. trap cleanup_overlay EXIT was silently replacing the earlier trap 'PS4="+ "' exit — a second trap … EXIT overwrites the first, it doesn't chain. Folded both into a single trap 'cleanup_overlay; PS4="+ "' EXIT.

  3. install.sh — scoped the stale-driver cleanup to prebaked VHDs. remove_stale_baked_driver now only runs when a baked marker is actually present. So the default install path (non-prebaked VHD, no marker) is byte-for-byte the legacy behaviour and never touches DKMS cleanup — smaller blast radius for the new logic.

Why I did not remove the stale-driver cleanup

My first instinct was that remove_stale_baked_driver / unload_nvidia_modules (~60 lines) were defending a hypothetical shared-VHD scenario and could be dropped. Checking the consumer (Azure/AgentBaker#8661) changed my mind: AgentBaker selects install-skip-build purely on marker file presence and delegates the actual kind/version/kernel match to baked_marker_matches here. That makes the mismatch → full-build path genuinely reachable (a CUDA-baked VHD on a GRID SKU, or a driver-version bump since bake), and since nvidia-installer --dkms collides with an already-registered DKMS tree, the cleanup is load-bearing for that fallback to work. So I kept it, scoped it, and rewrote the comments that called it a "no-op on today's VHDs" (which is what nearly led me to delete it).

No functional change to build-only or the matching install-skip-build fast path. bash -n clean on both files; shellcheck adds no new findings (remaining hits are pre-existing info-level SC2086 on lines this PR doesn't touch).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant