Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions .agents/skills/helm-dev-environment/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,12 @@ mise run helm:k3s:create
```

Creates a k3d cluster and merges its kubeconfig into the worktree-local `kubeconfig` file.
Also applies the upstream agent-sandbox CRDs/controller (pinned via `AGENT_SANDBOX_VERSION`
in `tasks/scripts/helm-k3s-local.sh`, fetched from `github.com/kubernetes-sigs/agent-sandbox`
releases) and preloads the default community sandbox image into k3d so the first sandbox
create does not wait on a large registry pull. Traefik is disabled at cluster creation time.
Also applies the upstream agent-sandbox CRDs/controller plus the warm-pool extensions
(`SandboxTemplate` / `SandboxWarmPool` / `SandboxClaim`, from `extensions.yaml`) — both
pinned via `AGENT_SANDBOX_VERSION` in `tasks/scripts/helm-k3s-local.sh`, fetched from
`github.com/kubernetes-sigs/agent-sandbox` releases — and preloads the default community
sandbox image into k3d so the first sandbox create does not wait on a large registry pull.
Traefik is disabled at cluster creation time.

**Multi-worktree support:** the cluster name is derived from the last component of the
current git branch (e.g. branch `kube-support/local-dev/tmutch` → cluster
Expand Down
10 changes: 9 additions & 1 deletion e2e/with-kube-gateway.sh
Original file line number Diff line number Diff line change
Expand Up @@ -533,10 +533,18 @@ fi
# The Kubernetes compute driver creates and watches Sandbox CRs reconciled
# by the upstream agent-sandbox-controller. Without the CRD + controller,
# every gateway K8s call 404s and CreateSandbox never produces a Pod.
echo "Installing agent-sandbox CRDs and controller (${AGENT_SANDBOX_VERSION})..."
# The warm-pool extensions (SandboxTemplate / SandboxWarmPool / SandboxClaim) are
# applied alongside core so the e2e cluster matches the dev cluster and is ready
# for warm-pooled sandbox coverage. extensions.yaml reconfigures the
# agent-sandbox-controller deployment, so wait for the rollout after both applies.
echo "Installing agent-sandbox CRDs, controller, and warm-pool extensions (${AGENT_SANDBOX_VERSION})..."
_agent_sandbox_base="https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${AGENT_SANDBOX_VERSION}"
kctl apply -f "${_agent_sandbox_base}/manifest.yaml"
kctl apply -f "${_agent_sandbox_base}/extensions.yaml"
kctl wait --for=condition=Established crd/sandboxes.agents.x-k8s.io --timeout=120s
kctl wait --for=condition=Established crd/sandboxtemplates.extensions.agents.x-k8s.io --timeout=120s
kctl wait --for=condition=Established crd/sandboxwarmpools.extensions.agents.x-k8s.io --timeout=120s
kctl wait --for=condition=Established crd/sandboxclaims.extensions.agents.x-k8s.io --timeout=120s
kctl -n agent-sandbox-system rollout status deployment/agent-sandbox-controller --timeout=300s

helm_extra_args=()
Expand Down
175 changes: 175 additions & 0 deletions rfc/0005-warm-pooled-sandboxes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
---
authors:
- "@rmalani-nv"
state: review
links:
- https://github.com/NVIDIA/OpenShell/pull/1813
- https://github.com/kubernetes-sigs/agent-sandbox/releases/tag/v0.4.6
- https://github.com/kubernetes-sigs/agent-sandbox
- https://agent-sandbox.sigs.k8s.io/docs/
---

# RFC 0005 - Warm-Pooled Sandboxes

## Summary

Add support for **warm-pooled sandboxes** on the Kubernetes compute driver by
adopting the upstream [agent-sandbox](https://github.com/kubernetes-sigs/agent-sandbox)
warm-pool extension CRDs — `SandboxTemplate`, `SandboxWarmPool`, and
`SandboxClaim` (`extensions.agents.x-k8s.io/v1alpha1`). Instead of cold-starting
a `Sandbox` CR + Pod per request, the gateway claims a pre-provisioned, ready Pod
from a pool, cutting time-to-ready from seconds to milliseconds. The extensions
ship in the same `v0.4.6` release OpenShell already pins for the core `Sandbox`
CRD; OpenShell simply does not install or use them today.

## Motivation

Creating a Kubernetes sandbox today is a cold start: the gateway creates a
`Sandbox` CR, the agent-sandbox controller creates a Pod, the image is pulled (or
read from cache), the supervisor boots, and only then does the sandbox become
`Ready`. Measured locally this is ~4s+ even with the image preloaded. For
interactive agent workloads and high-churn "fresh sandbox per task" usage, that
latency dominates. A warm pool keeps N ready Pods standing by so a claim binds in
**~0.1s** (measured on a local spike).

## Non-goals

- Changing the default (cold) sandbox-create path. Warm pooling is additive and
opt-in; sandboxes that don't match a pool fall back to a cold create.
- GPU warm pools in the initial rollout (idle accelerators are expensive — opt-in
later, per pool).
- Migrating OpenShell's core `Sandbox` usage from `v1alpha1` to `v1beta1`. The
pinned `v0.4.6` release serves `v1alpha1` for both core and extensions;
upstream `main` (`v1beta1`, mutually-exclusive claim fields) is out of scope
until OpenShell bumps the pinned version.
- Multiplayer/non-Kubernetes drivers (Docker, Podman, VM) — warm pooling is a
Kubernetes-driver capability in this RFC.

## Proposal

### Extension CRDs (verified against v0.4.6)

| CRD (`extensions.agents.x-k8s.io/v1alpha1`) | Role |
|---|---|
| `SandboxTemplate` | Reusable blueprint: `spec.podTemplate`, `spec.volumeClaimTemplates`, `spec.networkPolicy` |
| `SandboxWarmPool` | Keeps N Pods warm: `spec.replicas`, `spec.sandboxTemplateRef`; `status.{readyReplicas,replicas,selector}` (HPA-scalable) |
| `SandboxClaim` | Binds a warm Pod: `spec.sandboxTemplateRef` (required), `spec.warmpool`, `spec.additionalPodMetadata.{annotations,labels}`, `spec.env[]`, `spec.lifecycle`; `status.sandbox.{name,podIPs}` |

A `SandboxWarmPool` pre-creates real `Sandbox` CRs from a `SandboxTemplate`; each
warm Pod is owned by a *controlling* `Sandbox` ownerReference. A `SandboxClaim`
binds one of those warm `Sandbox`/Pods and reports the bound `Sandbox` in
`status.sandbox.name`. The claimed Pod's owning `Sandbox` CR is in turn owned by
the `SandboxClaim` (controlling ownerReference) and labeled
`agents.x-k8s.io/claim-uid`.

### Claim-based create flow

The gateway pre-declares one or more `SandboxWarmPool`s (+ their
`SandboxTemplate`s), each carrying the **shared** OpenShell Pod configuration
(image, mTLS secret mount, projected SA-token volume, supervisor sideload, Linux
capabilities, host aliases, runtimeClass, resources, workspace
`volumeClaimTemplates`). On `CreateSandbox`, when the requested shape matches a
pool, the Kubernetes driver creates a `SandboxClaim` (instead of a `Sandbox`)
that injects the per-sandbox identity via
`additionalPodMetadata.annotations[openshell.io/sandbox-id]`, then watches the
claim and maps `status.sandbox.{name,podIPs}` + conditions to `SandboxPhase`.

What bakes vs. late-binds:

- **Baked into the shared `SandboxTemplate`:** everything generic across pooled
Pods (TLS, SA token, supervisor, caps, workspace VCT).
- **Injected per-claim (annotation only):** `openshell.io/sandbox-id`. Per-claim
`env[]` is **rejected on the warm path** (Pod env is immutable once running), so
identity must not ride Pod env.
- **Late-bound at runtime over the supervisor relay (already works):** policy,
providers. Sandbox identity is established by the existing token exchange — the
supervisor presents its projected SA token to `IssueSandboxToken`, and the
gateway resolves identity server-side. The supervisor's `--sandbox-id` is
optional (log-push/policy labeling only).

### Identity re-anchoring (the one security-sensitive change)

Today `validate_sandbox_owner_reference()` in
`crates/openshell-server/src/auth/k8s_sa.rs` authenticates a sandbox by
cross-checking the owning `Sandbox` CR's `openshell.ai/sandbox-id` label against
the Pod's `openshell.io/sandbox-id` annotation. On the warm path the pool
controller creates the `Sandbox` CR generically, so it carries
`agents.x-k8s.io/claim-uid` (+ a controlling `SandboxClaim` ownerReference)
instead of OpenShell's label.

The check must therefore **re-anchor to the gateway-created `SandboxClaim`**:
resolve Pod → owning `Sandbox` CR → controlling `SandboxClaim` (name + uid) →
the sandbox-id the gateway recorded for that claim (gateway Store, keyed by
claim-uid), and verify the claim is bound (`status.sandbox.name` equals the
owning CR) and that its recorded sandbox-id equals the Pod annotation. This
preserves the existing invariant — *the sandbox-id a Pod can obtain equals a
value only the gateway wrote, on an object the sandbox workload cannot mutate*.
The sandbox ServiceAccount has no write access to `sandboxclaims` or Pods today
(confirmed on a live cluster), and the phase-2/phase-3 RBAC must preserve that.
The TokenReview, pod-UID, and ownerReference legs are unchanged.

## Implementation plan

Rollout is incremental; each phase is a separate, reviewable PR. The
security-sensitive auth change (phase 3) is gated behind `state:agent-ready`.

1. **Install the extensions (this PR).** Apply `extensions.yaml` in the local
k3d dev script and the e2e kube harness so clusters are ready for warm
pooling. No gateway behavior change yet.
2. **Driver warm path (flagged).** When a sandbox maps to a configured pool, the
Kubernetes driver creates a `SandboxClaim` (template + warmpool +
`additionalPodMetadata.annotations[openshell.io/sandbox-id]`) instead of a
`Sandbox`; watch the claim and map `status``SandboxPhase`. Keep the
direct-`Sandbox` path as the cold fallback. Add gateway RBAC for
`extensions.agents.x-k8s.io` (`sandboxclaims`, `sandboxtemplates`,
`sandboxwarmpools`) in the Helm chart.
3. **Auth re-anchoring.** Adapt `validate_sandbox_owner_reference()` for the
claim-based identity check above; fail closed; extend the table-driven tests
in `k8s_sa.rs` with the spoof case (Pod annotation ≠ claim record → reject).
4. **Pool management.** Gateway declares/reconciles `SandboxTemplate` +
`SandboxWarmPool` from gateway config (one per template/image shape); sizing,
`replicas`, GC of drained pools.
5. **Surface + docs.** `gateway.toml` pool config (`docs/reference/gateway-config.mdx`),
CLI/TUI visibility, OCSF events, e2e coverage, published Kubernetes docs.

## Risks

- **Identity binding is security-sensitive.** Mishandled, a sandbox could
impersonate another sandbox-id. Mitigated by re-anchoring to the
gateway-created claim, failing closed, threat-model unit tests, an RBAC
assertion test, an adversarial security review, and OCSF detection findings on
mismatch. See phase 3.
- **Pool shape rigidity.** A pool is one (image, resources, runtimeClass, gpu)
shape; heterogeneous sandboxes need a pool each, and unmatched requests fall
back to cold. Warm pooling pays off most for the high-churn default image.
- **Idle cost.** Warm Pods consume resources while idle; GPU pools especially.
Sizing must be operator-controlled and default conservative.
- **Upstream API drift.** `v0.4.6` extensions are `v1alpha1`; `main` is `v1beta1`
with different claim semantics. Pin and bump deliberately.

## Alternatives

- **Patch identity onto the claimed Pod/`Sandbox` after bind** (keep the existing
label cross-check). Rejected: requires granting the gateway `patch pods`
(currently denied for immutability) and is racy.
- **Bare-Pod warm pools** (if upstream changes the pool to create Pods, not
`Sandbox` CRs — see upstream issue #390). Would break the ownerReference auth
chain and force a larger rework. The pinned `v0.4.6` creates `Sandbox` CRs.
- **Do nothing.** Accept cold-start latency. Viable for low-churn usage but poor
for interactive agents.

## Prior art

Upstream agent-sandbox documents warm pooling end to end, including an
HPA-driven autoscaling example keyed on `agent_sandbox_claim_creation_total` and
the `SandboxWarmPool` `status.selector`. OpenShell already builds on the core
`Sandbox` CRD from the same project.

## Open questions

- Sandbox-id delivery to the supervisor on the warm path: rely solely on the
gateway JWT, or add a Downward API volume projecting the claim-injected
annotation for log-push/policy labeling?
- Workspace PVC semantics for pooled `Sandbox`es (each warm `Sandbox` seeds its
own PVC from the image — confirm under the `volumeClaimTemplates` path).
- Pool sizing / autoscaling policy and config surface in `gateway.toml`.
5 changes: 5 additions & 0 deletions tasks/scripts/helm-k3s-local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,11 @@ apply_base_manifests() {
local base="https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${AGENT_SANDBOX_VERSION}"
echo "Applying agent-sandbox manifest (${AGENT_SANDBOX_VERSION})..."
kubectl --kubeconfig="${KUBECONFIG_TARGET}" apply -f "${base}/manifest.yaml"
# Warm-pool extensions (SandboxTemplate / SandboxWarmPool / SandboxClaim) so the
# cluster is ready for warm-pooled sandboxes. extensions.yaml reconfigures the
# existing agent-sandbox-controller deployment rather than adding a new one.
echo "Applying agent-sandbox warm-pool extensions (${AGENT_SANDBOX_VERSION})..."
kubectl --kubeconfig="${KUBECONFIG_TARGET}" apply -f "${base}/extensions.yaml"
}

configure_ghcr_credentials() {
Expand Down
Loading