From be01e29082f66a3d05fcba64c83636700cefff29 Mon Sep 17 00:00:00 2001 From: Roshni Malani Date: Mon, 8 Jun 2026 11:27:46 -0700 Subject: [PATCH 1/3] docs(rfc): add RFC 0005 for warm-pooled sandboxes Propose adopting the upstream agent-sandbox warm-pool extension CRDs (SandboxTemplate / SandboxWarmPool / SandboxClaim, extensions.agents.x-k8s.io/v1alpha1) on the Kubernetes driver to hand out pre-warmed sandbox pods in ~milliseconds instead of cold-starting a Sandbox CR per request. Documents the claim-based create flow, what bakes into the shared template vs. late-binds over the supervisor relay, the one security-sensitive change (re-anchoring sandbox identity to the gateway-created SandboxClaim in auth/k8s_sa.rs), risks, alternatives, and a phased rollout. Drafted from a local spike validated against agent-sandbox v0.4.6. Signed-off-by: Roshni Malani --- rfc/0005-warm-pooled-sandboxes/README.md | 175 +++++++++++++++++++++++ 1 file changed, 175 insertions(+) create mode 100644 rfc/0005-warm-pooled-sandboxes/README.md diff --git a/rfc/0005-warm-pooled-sandboxes/README.md b/rfc/0005-warm-pooled-sandboxes/README.md new file mode 100644 index 000000000..f3ebb4ab5 --- /dev/null +++ b/rfc/0005-warm-pooled-sandboxes/README.md @@ -0,0 +1,175 @@ +--- +authors: + - "@rmalani-nv" +state: review +links: + - https://github.com/NVIDIA/OpenShell/pull/1813 + - https://github.com/kubernetes-sigs/agent-sandbox/releases/tag/v0.4.6 + - https://github.com/kubernetes-sigs/agent-sandbox + - https://agent-sandbox.sigs.k8s.io/docs/ +--- + +# RFC 0005 - Warm-Pooled Sandboxes + +## Summary + +Add support for **warm-pooled sandboxes** on the Kubernetes compute driver by +adopting the upstream [agent-sandbox](https://github.com/kubernetes-sigs/agent-sandbox) +warm-pool extension CRDs — `SandboxTemplate`, `SandboxWarmPool`, and +`SandboxClaim` (`extensions.agents.x-k8s.io/v1alpha1`). Instead of cold-starting +a `Sandbox` CR + Pod per request, the gateway claims a pre-provisioned, ready Pod +from a pool, cutting time-to-ready from seconds to milliseconds. The extensions +ship in the same `v0.4.6` release OpenShell already pins for the core `Sandbox` +CRD; OpenShell simply does not install or use them today. + +## Motivation + +Creating a Kubernetes sandbox today is a cold start: the gateway creates a +`Sandbox` CR, the agent-sandbox controller creates a Pod, the image is pulled (or +read from cache), the supervisor boots, and only then does the sandbox become +`Ready`. Measured locally this is ~4s+ even with the image preloaded. For +interactive agent workloads and high-churn "fresh sandbox per task" usage, that +latency dominates. A warm pool keeps N ready Pods standing by so a claim binds in +**~0.1s** (measured on a local spike). + +## Non-goals + +- Changing the default (cold) sandbox-create path. Warm pooling is additive and + opt-in; sandboxes that don't match a pool fall back to a cold create. +- GPU warm pools in the initial rollout (idle accelerators are expensive — opt-in + later, per pool). +- Migrating OpenShell's core `Sandbox` usage from `v1alpha1` to `v1beta1`. The + pinned `v0.4.6` release serves `v1alpha1` for both core and extensions; + upstream `main` (`v1beta1`, mutually-exclusive claim fields) is out of scope + until OpenShell bumps the pinned version. +- Multiplayer/non-Kubernetes drivers (Docker, Podman, VM) — warm pooling is a + Kubernetes-driver capability in this RFC. + +## Proposal + +### Extension CRDs (verified against v0.4.6) + +| CRD (`extensions.agents.x-k8s.io/v1alpha1`) | Role | +|---|---| +| `SandboxTemplate` | Reusable blueprint: `spec.podTemplate`, `spec.volumeClaimTemplates`, `spec.networkPolicy` | +| `SandboxWarmPool` | Keeps N Pods warm: `spec.replicas`, `spec.sandboxTemplateRef`; `status.{readyReplicas,replicas,selector}` (HPA-scalable) | +| `SandboxClaim` | Binds a warm Pod: `spec.sandboxTemplateRef` (required), `spec.warmpool`, `spec.additionalPodMetadata.{annotations,labels}`, `spec.env[]`, `spec.lifecycle`; `status.sandbox.{name,podIPs}` | + +A `SandboxWarmPool` pre-creates real `Sandbox` CRs from a `SandboxTemplate`; each +warm Pod is owned by a *controlling* `Sandbox` ownerReference. A `SandboxClaim` +binds one of those warm `Sandbox`/Pods and reports the bound `Sandbox` in +`status.sandbox.name`. The claimed Pod's owning `Sandbox` CR is in turn owned by +the `SandboxClaim` (controlling ownerReference) and labeled +`agents.x-k8s.io/claim-uid`. + +### Claim-based create flow + +The gateway pre-declares one or more `SandboxWarmPool`s (+ their +`SandboxTemplate`s), each carrying the **shared** OpenShell Pod configuration +(image, mTLS secret mount, projected SA-token volume, supervisor sideload, Linux +capabilities, host aliases, runtimeClass, resources, workspace +`volumeClaimTemplates`). On `CreateSandbox`, when the requested shape matches a +pool, the Kubernetes driver creates a `SandboxClaim` (instead of a `Sandbox`) +that injects the per-sandbox identity via +`additionalPodMetadata.annotations[openshell.io/sandbox-id]`, then watches the +claim and maps `status.sandbox.{name,podIPs}` + conditions to `SandboxPhase`. + +What bakes vs. late-binds: + +- **Baked into the shared `SandboxTemplate`:** everything generic across pooled + Pods (TLS, SA token, supervisor, caps, workspace VCT). +- **Injected per-claim (annotation only):** `openshell.io/sandbox-id`. Per-claim + `env[]` is **rejected on the warm path** (Pod env is immutable once running), so + identity must not ride Pod env. +- **Late-bound at runtime over the supervisor relay (already works):** policy, + providers. Sandbox identity is established by the existing token exchange — the + supervisor presents its projected SA token to `IssueSandboxToken`, and the + gateway resolves identity server-side. The supervisor's `--sandbox-id` is + optional (log-push/policy labeling only). + +### Identity re-anchoring (the one security-sensitive change) + +Today `validate_sandbox_owner_reference()` in +`crates/openshell-server/src/auth/k8s_sa.rs` authenticates a sandbox by +cross-checking the owning `Sandbox` CR's `openshell.ai/sandbox-id` label against +the Pod's `openshell.io/sandbox-id` annotation. On the warm path the pool +controller creates the `Sandbox` CR generically, so it carries +`agents.x-k8s.io/claim-uid` (+ a controlling `SandboxClaim` ownerReference) +instead of OpenShell's label. + +The check must therefore **re-anchor to the gateway-created `SandboxClaim`**: +resolve Pod → owning `Sandbox` CR → controlling `SandboxClaim` (name + uid) → +the sandbox-id the gateway recorded for that claim (gateway Store, keyed by +claim-uid), and verify the claim is bound (`status.sandbox.name` equals the +owning CR) and that its recorded sandbox-id equals the Pod annotation. This +preserves the existing invariant — *the sandbox-id a Pod can obtain equals a +value only the gateway wrote, on an object the sandbox workload cannot mutate*. +The sandbox ServiceAccount has no write access to `sandboxclaims` or Pods today +(confirmed on a live cluster), and the phase-2/phase-3 RBAC must preserve that. +The TokenReview, pod-UID, and ownerReference legs are unchanged. + +## Implementation plan + +Rollout is incremental; each phase is a separate, reviewable PR. The +security-sensitive auth change (phase 3) is gated behind `state:agent-ready`. + +1. **Install the extensions (this PR).** Apply `extensions.yaml` in the local + k3d dev script and the e2e kube harness so clusters are ready for warm + pooling. No gateway behavior change yet. +2. **Driver warm path (flagged).** When a sandbox maps to a configured pool, the + Kubernetes driver creates a `SandboxClaim` (template + warmpool + + `additionalPodMetadata.annotations[openshell.io/sandbox-id]`) instead of a + `Sandbox`; watch the claim and map `status` → `SandboxPhase`. Keep the + direct-`Sandbox` path as the cold fallback. Add gateway RBAC for + `extensions.agents.x-k8s.io` (`sandboxclaims`, `sandboxtemplates`, + `sandboxwarmpools`) in the Helm chart. +3. **Auth re-anchoring.** Adapt `validate_sandbox_owner_reference()` for the + claim-based identity check above; fail closed; extend the table-driven tests + in `k8s_sa.rs` with the spoof case (Pod annotation ≠ claim record → reject). +4. **Pool management.** Gateway declares/reconciles `SandboxTemplate` + + `SandboxWarmPool` from gateway config (one per template/image shape); sizing, + `replicas`, GC of drained pools. +5. **Surface + docs.** `gateway.toml` pool config (`docs/reference/gateway-config.mdx`), + CLI/TUI visibility, OCSF events, e2e coverage, published Kubernetes docs. + +## Risks + +- **Identity binding is security-sensitive.** Mishandled, a sandbox could + impersonate another sandbox-id. Mitigated by re-anchoring to the + gateway-created claim, failing closed, threat-model unit tests, an RBAC + assertion test, an adversarial security review, and OCSF detection findings on + mismatch. See phase 3. +- **Pool shape rigidity.** A pool is one (image, resources, runtimeClass, gpu) + shape; heterogeneous sandboxes need a pool each, and unmatched requests fall + back to cold. Warm pooling pays off most for the high-churn default image. +- **Idle cost.** Warm Pods consume resources while idle; GPU pools especially. + Sizing must be operator-controlled and default conservative. +- **Upstream API drift.** `v0.4.6` extensions are `v1alpha1`; `main` is `v1beta1` + with different claim semantics. Pin and bump deliberately. + +## Alternatives + +- **Patch identity onto the claimed Pod/`Sandbox` after bind** (keep the existing + label cross-check). Rejected: requires granting the gateway `patch pods` + (currently denied for immutability) and is racy. +- **Bare-Pod warm pools** (if upstream changes the pool to create Pods, not + `Sandbox` CRs — see upstream issue #390). Would break the ownerReference auth + chain and force a larger rework. The pinned `v0.4.6` creates `Sandbox` CRs. +- **Do nothing.** Accept cold-start latency. Viable for low-churn usage but poor + for interactive agents. + +## Prior art + +Upstream agent-sandbox documents warm pooling end to end, including an +HPA-driven autoscaling example keyed on `agent_sandbox_claim_creation_total` and +the `SandboxWarmPool` `status.selector`. OpenShell already builds on the core +`Sandbox` CRD from the same project. + +## Open questions + +- Sandbox-id delivery to the supervisor on the warm path: rely solely on the + gateway JWT, or add a Downward API volume projecting the claim-injected + annotation for log-push/policy labeling? +- Workspace PVC semantics for pooled `Sandbox`es (each warm `Sandbox` seeds its + own PVC from the image — confirm under the `volumeClaimTemplates` path). +- Pool sizing / autoscaling policy and config surface in `gateway.toml`. From 23fba73bb56f21a39a74ee9f90ccefe921983b84 Mon Sep 17 00:00:00 2001 From: Roshni Malani Date: Mon, 8 Jun 2026 11:27:46 -0700 Subject: [PATCH 2/3] chore(deploy): install agent-sandbox warm-pool extensions in dev and e2e clusters Apply extensions.yaml alongside manifest.yaml when bootstrapping the local k3d dev cluster and the e2e kube harness, reusing the pinned AGENT_SANDBOX_VERSION already used for core. This installs the SandboxTemplate / SandboxWarmPool / SandboxClaim CRDs and reconfigures the existing agent-sandbox-controller, so clusters are ready for the warm-pooled sandbox path (RFC 0005). extensions.yaml rolls the controller deployment, so the e2e harness waits for the rollout after both applies and for the new extension CRDs to be Established. No gateway behavior changes yet. Signed-off-by: Roshni Malani --- e2e/with-kube-gateway.sh | 10 +++++++++- tasks/scripts/helm-k3s-local.sh | 5 +++++ 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/e2e/with-kube-gateway.sh b/e2e/with-kube-gateway.sh index 39ba899db..52da97b2d 100755 --- a/e2e/with-kube-gateway.sh +++ b/e2e/with-kube-gateway.sh @@ -533,10 +533,18 @@ fi # The Kubernetes compute driver creates and watches Sandbox CRs reconciled # by the upstream agent-sandbox-controller. Without the CRD + controller, # every gateway K8s call 404s and CreateSandbox never produces a Pod. -echo "Installing agent-sandbox CRDs and controller (${AGENT_SANDBOX_VERSION})..." +# The warm-pool extensions (SandboxTemplate / SandboxWarmPool / SandboxClaim) are +# applied alongside core so the e2e cluster matches the dev cluster and is ready +# for warm-pooled sandbox coverage. extensions.yaml reconfigures the +# agent-sandbox-controller deployment, so wait for the rollout after both applies. +echo "Installing agent-sandbox CRDs, controller, and warm-pool extensions (${AGENT_SANDBOX_VERSION})..." _agent_sandbox_base="https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${AGENT_SANDBOX_VERSION}" kctl apply -f "${_agent_sandbox_base}/manifest.yaml" +kctl apply -f "${_agent_sandbox_base}/extensions.yaml" kctl wait --for=condition=Established crd/sandboxes.agents.x-k8s.io --timeout=120s +kctl wait --for=condition=Established crd/sandboxtemplates.extensions.agents.x-k8s.io --timeout=120s +kctl wait --for=condition=Established crd/sandboxwarmpools.extensions.agents.x-k8s.io --timeout=120s +kctl wait --for=condition=Established crd/sandboxclaims.extensions.agents.x-k8s.io --timeout=120s kctl -n agent-sandbox-system rollout status deployment/agent-sandbox-controller --timeout=300s helm_extra_args=() diff --git a/tasks/scripts/helm-k3s-local.sh b/tasks/scripts/helm-k3s-local.sh index 59d8939a6..6ed2a524f 100755 --- a/tasks/scripts/helm-k3s-local.sh +++ b/tasks/scripts/helm-k3s-local.sh @@ -143,6 +143,11 @@ apply_base_manifests() { local base="https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${AGENT_SANDBOX_VERSION}" echo "Applying agent-sandbox manifest (${AGENT_SANDBOX_VERSION})..." kubectl --kubeconfig="${KUBECONFIG_TARGET}" apply -f "${base}/manifest.yaml" + # Warm-pool extensions (SandboxTemplate / SandboxWarmPool / SandboxClaim) so the + # cluster is ready for warm-pooled sandboxes. extensions.yaml reconfigures the + # existing agent-sandbox-controller deployment rather than adding a new one. + echo "Applying agent-sandbox warm-pool extensions (${AGENT_SANDBOX_VERSION})..." + kubectl --kubeconfig="${KUBECONFIG_TARGET}" apply -f "${base}/extensions.yaml" } configure_ghcr_credentials() { From 9dd7e1ae32dee3133adef0fa6d0bc1d7fb896677 Mon Sep 17 00:00:00 2001 From: Roshni Malani Date: Mon, 8 Jun 2026 11:27:46 -0700 Subject: [PATCH 3/3] docs(skills): note warm-pool extensions in helm-dev-environment skill The local k3d bootstrap now also applies the agent-sandbox warm-pool extensions; reflect that in the helm-dev-environment skill description. Signed-off-by: Roshni Malani --- .agents/skills/helm-dev-environment/SKILL.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/.agents/skills/helm-dev-environment/SKILL.md b/.agents/skills/helm-dev-environment/SKILL.md index 58efbfef8..3c9dd3c73 100644 --- a/.agents/skills/helm-dev-environment/SKILL.md +++ b/.agents/skills/helm-dev-environment/SKILL.md @@ -26,10 +26,12 @@ mise run helm:k3s:create ``` Creates a k3d cluster and merges its kubeconfig into the worktree-local `kubeconfig` file. -Also applies the upstream agent-sandbox CRDs/controller (pinned via `AGENT_SANDBOX_VERSION` -in `tasks/scripts/helm-k3s-local.sh`, fetched from `github.com/kubernetes-sigs/agent-sandbox` -releases) and preloads the default community sandbox image into k3d so the first sandbox -create does not wait on a large registry pull. Traefik is disabled at cluster creation time. +Also applies the upstream agent-sandbox CRDs/controller plus the warm-pool extensions +(`SandboxTemplate` / `SandboxWarmPool` / `SandboxClaim`, from `extensions.yaml`) — both +pinned via `AGENT_SANDBOX_VERSION` in `tasks/scripts/helm-k3s-local.sh`, fetched from +`github.com/kubernetes-sigs/agent-sandbox` releases — and preloads the default community +sandbox image into k3d so the first sandbox create does not wait on a large registry pull. +Traefik is disabled at cluster creation time. **Multi-worktree support:** the cluster name is derived from the last component of the current git branch (e.g. branch `kube-support/local-dev/tmutch` → cluster