diff --git a/.agents/skills/helm-dev-environment/SKILL.md b/.agents/skills/helm-dev-environment/SKILL.md index 58efbfef8..3c9dd3c73 100644 --- a/.agents/skills/helm-dev-environment/SKILL.md +++ b/.agents/skills/helm-dev-environment/SKILL.md @@ -26,10 +26,12 @@ mise run helm:k3s:create ``` Creates a k3d cluster and merges its kubeconfig into the worktree-local `kubeconfig` file. -Also applies the upstream agent-sandbox CRDs/controller (pinned via `AGENT_SANDBOX_VERSION` -in `tasks/scripts/helm-k3s-local.sh`, fetched from `github.com/kubernetes-sigs/agent-sandbox` -releases) and preloads the default community sandbox image into k3d so the first sandbox -create does not wait on a large registry pull. Traefik is disabled at cluster creation time. +Also applies the upstream agent-sandbox CRDs/controller plus the warm-pool extensions +(`SandboxTemplate` / `SandboxWarmPool` / `SandboxClaim`, from `extensions.yaml`) — both +pinned via `AGENT_SANDBOX_VERSION` in `tasks/scripts/helm-k3s-local.sh`, fetched from +`github.com/kubernetes-sigs/agent-sandbox` releases — and preloads the default community +sandbox image into k3d so the first sandbox create does not wait on a large registry pull. +Traefik is disabled at cluster creation time. **Multi-worktree support:** the cluster name is derived from the last component of the current git branch (e.g. branch `kube-support/local-dev/tmutch` → cluster diff --git a/e2e/with-kube-gateway.sh b/e2e/with-kube-gateway.sh index 39ba899db..52da97b2d 100755 --- a/e2e/with-kube-gateway.sh +++ b/e2e/with-kube-gateway.sh @@ -533,10 +533,18 @@ fi # The Kubernetes compute driver creates and watches Sandbox CRs reconciled # by the upstream agent-sandbox-controller. Without the CRD + controller, # every gateway K8s call 404s and CreateSandbox never produces a Pod. -echo "Installing agent-sandbox CRDs and controller (${AGENT_SANDBOX_VERSION})..." +# The warm-pool extensions (SandboxTemplate / SandboxWarmPool / SandboxClaim) are +# applied alongside core so the e2e cluster matches the dev cluster and is ready +# for warm-pooled sandbox coverage. extensions.yaml reconfigures the +# agent-sandbox-controller deployment, so wait for the rollout after both applies. +echo "Installing agent-sandbox CRDs, controller, and warm-pool extensions (${AGENT_SANDBOX_VERSION})..." _agent_sandbox_base="https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${AGENT_SANDBOX_VERSION}" kctl apply -f "${_agent_sandbox_base}/manifest.yaml" +kctl apply -f "${_agent_sandbox_base}/extensions.yaml" kctl wait --for=condition=Established crd/sandboxes.agents.x-k8s.io --timeout=120s +kctl wait --for=condition=Established crd/sandboxtemplates.extensions.agents.x-k8s.io --timeout=120s +kctl wait --for=condition=Established crd/sandboxwarmpools.extensions.agents.x-k8s.io --timeout=120s +kctl wait --for=condition=Established crd/sandboxclaims.extensions.agents.x-k8s.io --timeout=120s kctl -n agent-sandbox-system rollout status deployment/agent-sandbox-controller --timeout=300s helm_extra_args=() diff --git a/rfc/0005-warm-pooled-sandboxes/README.md b/rfc/0005-warm-pooled-sandboxes/README.md new file mode 100644 index 000000000..f3ebb4ab5 --- /dev/null +++ b/rfc/0005-warm-pooled-sandboxes/README.md @@ -0,0 +1,175 @@ +--- +authors: + - "@rmalani-nv" +state: review +links: + - https://github.com/NVIDIA/OpenShell/pull/1813 + - https://github.com/kubernetes-sigs/agent-sandbox/releases/tag/v0.4.6 + - https://github.com/kubernetes-sigs/agent-sandbox + - https://agent-sandbox.sigs.k8s.io/docs/ +--- + +# RFC 0005 - Warm-Pooled Sandboxes + +## Summary + +Add support for **warm-pooled sandboxes** on the Kubernetes compute driver by +adopting the upstream [agent-sandbox](https://github.com/kubernetes-sigs/agent-sandbox) +warm-pool extension CRDs — `SandboxTemplate`, `SandboxWarmPool`, and +`SandboxClaim` (`extensions.agents.x-k8s.io/v1alpha1`). Instead of cold-starting +a `Sandbox` CR + Pod per request, the gateway claims a pre-provisioned, ready Pod +from a pool, cutting time-to-ready from seconds to milliseconds. The extensions +ship in the same `v0.4.6` release OpenShell already pins for the core `Sandbox` +CRD; OpenShell simply does not install or use them today. + +## Motivation + +Creating a Kubernetes sandbox today is a cold start: the gateway creates a +`Sandbox` CR, the agent-sandbox controller creates a Pod, the image is pulled (or +read from cache), the supervisor boots, and only then does the sandbox become +`Ready`. Measured locally this is ~4s+ even with the image preloaded. For +interactive agent workloads and high-churn "fresh sandbox per task" usage, that +latency dominates. A warm pool keeps N ready Pods standing by so a claim binds in +**~0.1s** (measured on a local spike). + +## Non-goals + +- Changing the default (cold) sandbox-create path. Warm pooling is additive and + opt-in; sandboxes that don't match a pool fall back to a cold create. +- GPU warm pools in the initial rollout (idle accelerators are expensive — opt-in + later, per pool). +- Migrating OpenShell's core `Sandbox` usage from `v1alpha1` to `v1beta1`. The + pinned `v0.4.6` release serves `v1alpha1` for both core and extensions; + upstream `main` (`v1beta1`, mutually-exclusive claim fields) is out of scope + until OpenShell bumps the pinned version. +- Multiplayer/non-Kubernetes drivers (Docker, Podman, VM) — warm pooling is a + Kubernetes-driver capability in this RFC. + +## Proposal + +### Extension CRDs (verified against v0.4.6) + +| CRD (`extensions.agents.x-k8s.io/v1alpha1`) | Role | +|---|---| +| `SandboxTemplate` | Reusable blueprint: `spec.podTemplate`, `spec.volumeClaimTemplates`, `spec.networkPolicy` | +| `SandboxWarmPool` | Keeps N Pods warm: `spec.replicas`, `spec.sandboxTemplateRef`; `status.{readyReplicas,replicas,selector}` (HPA-scalable) | +| `SandboxClaim` | Binds a warm Pod: `spec.sandboxTemplateRef` (required), `spec.warmpool`, `spec.additionalPodMetadata.{annotations,labels}`, `spec.env[]`, `spec.lifecycle`; `status.sandbox.{name,podIPs}` | + +A `SandboxWarmPool` pre-creates real `Sandbox` CRs from a `SandboxTemplate`; each +warm Pod is owned by a *controlling* `Sandbox` ownerReference. A `SandboxClaim` +binds one of those warm `Sandbox`/Pods and reports the bound `Sandbox` in +`status.sandbox.name`. The claimed Pod's owning `Sandbox` CR is in turn owned by +the `SandboxClaim` (controlling ownerReference) and labeled +`agents.x-k8s.io/claim-uid`. + +### Claim-based create flow + +The gateway pre-declares one or more `SandboxWarmPool`s (+ their +`SandboxTemplate`s), each carrying the **shared** OpenShell Pod configuration +(image, mTLS secret mount, projected SA-token volume, supervisor sideload, Linux +capabilities, host aliases, runtimeClass, resources, workspace +`volumeClaimTemplates`). On `CreateSandbox`, when the requested shape matches a +pool, the Kubernetes driver creates a `SandboxClaim` (instead of a `Sandbox`) +that injects the per-sandbox identity via +`additionalPodMetadata.annotations[openshell.io/sandbox-id]`, then watches the +claim and maps `status.sandbox.{name,podIPs}` + conditions to `SandboxPhase`. + +What bakes vs. late-binds: + +- **Baked into the shared `SandboxTemplate`:** everything generic across pooled + Pods (TLS, SA token, supervisor, caps, workspace VCT). +- **Injected per-claim (annotation only):** `openshell.io/sandbox-id`. Per-claim + `env[]` is **rejected on the warm path** (Pod env is immutable once running), so + identity must not ride Pod env. +- **Late-bound at runtime over the supervisor relay (already works):** policy, + providers. Sandbox identity is established by the existing token exchange — the + supervisor presents its projected SA token to `IssueSandboxToken`, and the + gateway resolves identity server-side. The supervisor's `--sandbox-id` is + optional (log-push/policy labeling only). + +### Identity re-anchoring (the one security-sensitive change) + +Today `validate_sandbox_owner_reference()` in +`crates/openshell-server/src/auth/k8s_sa.rs` authenticates a sandbox by +cross-checking the owning `Sandbox` CR's `openshell.ai/sandbox-id` label against +the Pod's `openshell.io/sandbox-id` annotation. On the warm path the pool +controller creates the `Sandbox` CR generically, so it carries +`agents.x-k8s.io/claim-uid` (+ a controlling `SandboxClaim` ownerReference) +instead of OpenShell's label. + +The check must therefore **re-anchor to the gateway-created `SandboxClaim`**: +resolve Pod → owning `Sandbox` CR → controlling `SandboxClaim` (name + uid) → +the sandbox-id the gateway recorded for that claim (gateway Store, keyed by +claim-uid), and verify the claim is bound (`status.sandbox.name` equals the +owning CR) and that its recorded sandbox-id equals the Pod annotation. This +preserves the existing invariant — *the sandbox-id a Pod can obtain equals a +value only the gateway wrote, on an object the sandbox workload cannot mutate*. +The sandbox ServiceAccount has no write access to `sandboxclaims` or Pods today +(confirmed on a live cluster), and the phase-2/phase-3 RBAC must preserve that. +The TokenReview, pod-UID, and ownerReference legs are unchanged. + +## Implementation plan + +Rollout is incremental; each phase is a separate, reviewable PR. The +security-sensitive auth change (phase 3) is gated behind `state:agent-ready`. + +1. **Install the extensions (this PR).** Apply `extensions.yaml` in the local + k3d dev script and the e2e kube harness so clusters are ready for warm + pooling. No gateway behavior change yet. +2. **Driver warm path (flagged).** When a sandbox maps to a configured pool, the + Kubernetes driver creates a `SandboxClaim` (template + warmpool + + `additionalPodMetadata.annotations[openshell.io/sandbox-id]`) instead of a + `Sandbox`; watch the claim and map `status` → `SandboxPhase`. Keep the + direct-`Sandbox` path as the cold fallback. Add gateway RBAC for + `extensions.agents.x-k8s.io` (`sandboxclaims`, `sandboxtemplates`, + `sandboxwarmpools`) in the Helm chart. +3. **Auth re-anchoring.** Adapt `validate_sandbox_owner_reference()` for the + claim-based identity check above; fail closed; extend the table-driven tests + in `k8s_sa.rs` with the spoof case (Pod annotation ≠ claim record → reject). +4. **Pool management.** Gateway declares/reconciles `SandboxTemplate` + + `SandboxWarmPool` from gateway config (one per template/image shape); sizing, + `replicas`, GC of drained pools. +5. **Surface + docs.** `gateway.toml` pool config (`docs/reference/gateway-config.mdx`), + CLI/TUI visibility, OCSF events, e2e coverage, published Kubernetes docs. + +## Risks + +- **Identity binding is security-sensitive.** Mishandled, a sandbox could + impersonate another sandbox-id. Mitigated by re-anchoring to the + gateway-created claim, failing closed, threat-model unit tests, an RBAC + assertion test, an adversarial security review, and OCSF detection findings on + mismatch. See phase 3. +- **Pool shape rigidity.** A pool is one (image, resources, runtimeClass, gpu) + shape; heterogeneous sandboxes need a pool each, and unmatched requests fall + back to cold. Warm pooling pays off most for the high-churn default image. +- **Idle cost.** Warm Pods consume resources while idle; GPU pools especially. + Sizing must be operator-controlled and default conservative. +- **Upstream API drift.** `v0.4.6` extensions are `v1alpha1`; `main` is `v1beta1` + with different claim semantics. Pin and bump deliberately. + +## Alternatives + +- **Patch identity onto the claimed Pod/`Sandbox` after bind** (keep the existing + label cross-check). Rejected: requires granting the gateway `patch pods` + (currently denied for immutability) and is racy. +- **Bare-Pod warm pools** (if upstream changes the pool to create Pods, not + `Sandbox` CRs — see upstream issue #390). Would break the ownerReference auth + chain and force a larger rework. The pinned `v0.4.6` creates `Sandbox` CRs. +- **Do nothing.** Accept cold-start latency. Viable for low-churn usage but poor + for interactive agents. + +## Prior art + +Upstream agent-sandbox documents warm pooling end to end, including an +HPA-driven autoscaling example keyed on `agent_sandbox_claim_creation_total` and +the `SandboxWarmPool` `status.selector`. OpenShell already builds on the core +`Sandbox` CRD from the same project. + +## Open questions + +- Sandbox-id delivery to the supervisor on the warm path: rely solely on the + gateway JWT, or add a Downward API volume projecting the claim-injected + annotation for log-push/policy labeling? +- Workspace PVC semantics for pooled `Sandbox`es (each warm `Sandbox` seeds its + own PVC from the image — confirm under the `volumeClaimTemplates` path). +- Pool sizing / autoscaling policy and config surface in `gateway.toml`. diff --git a/tasks/scripts/helm-k3s-local.sh b/tasks/scripts/helm-k3s-local.sh index 59d8939a6..6ed2a524f 100755 --- a/tasks/scripts/helm-k3s-local.sh +++ b/tasks/scripts/helm-k3s-local.sh @@ -143,6 +143,11 @@ apply_base_manifests() { local base="https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${AGENT_SANDBOX_VERSION}" echo "Applying agent-sandbox manifest (${AGENT_SANDBOX_VERSION})..." kubectl --kubeconfig="${KUBECONFIG_TARGET}" apply -f "${base}/manifest.yaml" + # Warm-pool extensions (SandboxTemplate / SandboxWarmPool / SandboxClaim) so the + # cluster is ready for warm-pooled sandboxes. extensions.yaml reconfigures the + # existing agent-sandbox-controller deployment rather than adding a new one. + echo "Applying agent-sandbox warm-pool extensions (${AGENT_SANDBOX_VERSION})..." + kubectl --kubeconfig="${KUBECONFIG_TARGET}" apply -f "${base}/extensions.yaml" } configure_ghcr_credentials() {