diff --git a/.agents/skills/debug-openshell-cluster/SKILL.md b/.agents/skills/debug-openshell-cluster/SKILL.md index b65bf26d2..46c1a3783 100644 --- a/.agents/skills/debug-openshell-cluster/SKILL.md +++ b/.agents/skills/debug-openshell-cluster/SKILL.md @@ -181,6 +181,21 @@ helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage The gateway image built from `deploy/docker/Dockerfile.gateway` and the scratch supervisor image built from `deploy/docker/Dockerfile.supervisor` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes. +Check the sandbox supervisor topology rendered into gateway config: + +```bash +kubectl -n openshell get configmap openshell-config -o jsonpath='{.data.gateway\.toml}' | grep -E 'supervisor_role|network_enforcement_mode|enforcer_endpoint|privileged' +``` + +Expected Kubernetes default is `supervisor_role = "workload"` and +`network_enforcement_mode = "soft-proxy"`. This starts unprivileged sandbox +pods and logs that direct socket egress is not kernel-blocked. Use +`network_enforcement_mode = "supervisor-netns"` only when the sandbox pod has +the required capabilities or `server.sandboxPrivileged=true`. Use +`network_enforcement_mode = "external-enforcer"` with `nodeEnforcer.enabled=true` +to test node-enforcer enforcement; the enforcer should log workload +registration and successful sandbox network egress enforcement installation. + For local/external pull mode (the default local path via `mise run cluster`), local images are tagged to the configured local registry base, pushed to that registry, and pulled by k3s via the `registries.yaml` mirror endpoint. The `cluster` task pushes prebuilt local tags (`openshell/*:dev`, falling back to `localhost:5000/openshell/*:dev` or `127.0.0.1:5000/openshell/*:dev`). Gateway image builds stage a partial Rust workspace from `deploy/docker/Dockerfile.images`. If cargo fails with a missing manifest under `/build/crates/...`, or an imported symbol exists locally but is missing in the image build, verify that every current gateway dependency crate, including `openshell-driver-docker`, `openshell-driver-kubernetes`, and `openshell-ocsf`, is copied into the staged workspace there. @@ -206,6 +221,18 @@ kubectl -n openshell get svc openshell -o wide kubectl -n openshell get endpoints openshell ``` +When the gateway is exposed through Envoy Gateway, deployment infrastructure may +need a `BackendTrafficPolicy` to disable Envoy's request and stream duration +timeouts for OpenShell's long-lived gRPC streams. A missing or rejected policy +commonly shows up as CLI failures around 15 seconds with `h2 protocol error: +error reading a body from connection`, especially on `sandbox create -- `, +upload/download, sync, `WatchSandbox`, `ForwardTcp`, and `RelayStream` paths. + +```bash +kubectl -n openshell get backendtrafficpolicy openshell-grpc-streams -o yaml +kubectl -n openshell get grpcroute openshell -o yaml +``` + For local port-forward testing: ```bash diff --git a/Cargo.lock b/Cargo.lock index d0cd77f85..401a6e039 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -3686,6 +3686,7 @@ dependencies = [ "tracing", "tracing-appender", "tracing-subscriber", + "url", "uuid", "webpki-roots 1.0.7", ] diff --git a/architecture/compute-runtimes.md b/architecture/compute-runtimes.md index b70a2fccc..d2c795332 100644 --- a/architecture/compute-runtimes.md +++ b/architecture/compute-runtimes.md @@ -87,6 +87,28 @@ runtime still owns GPU device injection. ## Deployment Shape Kubernetes deployments use the Helm chart under `deploy/helm/openshell`. +The Kubernetes driver can set a default `runtimeClassName` for sandbox pods, +for example `gvisor` or a Kata Containers RuntimeClass, while preserving +per-sandbox template overrides. When a default RuntimeClass is configured, the +Kubernetes driver validates its existence at startup so missing cluster runtime +support fails before any sandbox pods are requested. Per-sandbox RuntimeClass +overrides are validated during sandbox admission/create because they are not +known at gateway startup. The Kubernetes driver can also set +`securityContext.privileged` on all sandbox pod containers as a deployment-wide, +short-term compatibility escape hatch for clusters that require privileged pod +admission; this weakens the container boundary and is not a replacement for a +stronger runtime isolation model. Kubernetes deployments also select an +explicit supervisor/network topology. The default is +`supervisor_role = "workload"` with `network_enforcement_mode = "soft-proxy"`, +which keeps sandbox pods unprivileged and relies on the proxy for cooperative +traffic while reporting that direct sockets are not kernel-blocked. The existing +hard supervisor-managed netns/veth/nft path remains available through +`network_enforcement_mode = "supervisor-netns"`. The experimental +`external-enforcer` mode registers workload supervisors with a privileged +node-side enforcer DaemonSet, which enters the pod network namespace and +installs coarse nftables egress rules so non-root sandbox processes must use the +proxy. Dynamic endpoint, binary, and L7 policy remains inside the workload +proxy. Standalone local deployments start the gateway with a selected runtime such as Docker, Podman, or VM. The CLI can register multiple gateways and switch between them without changing the sandbox architecture. diff --git a/architecture/sandbox.md b/architecture/sandbox.md index 4bc6803eb..8f34aa7c4 100644 --- a/architecture/sandbox.md +++ b/architecture/sandbox.md @@ -12,6 +12,7 @@ Each sandbox workload has two trust levels: |---|---| | Supervisor | Starts as root inside the workload, prepares isolation, runs the proxy, fetches config, injects credentials, serves the relay socket, and launches child processes. | | Agent child | Runs as an unprivileged user with filesystem, process, and network restrictions applied. | +| Node enforcer | Optional privileged host/node-side supervisor role for Kubernetes. It accepts workload registrations and installs coarse pod-netns egress rules while policy decisions stay in the workload proxy. | The supervisor keeps enough privilege to manage the sandbox, but the agent child loses that privilege before user code runs. @@ -41,6 +42,13 @@ OpenShell uses overlapping controls rather than a single sandbox primitive: | Network namespace | Forces ordinary agent egress through the local CONNECT proxy. | | Policy proxy | Evaluates destination, binary identity, TLS/L7 rules, SSRF checks, and inference interception. | +The supervisor resolves an explicit network enforcement mode at startup. +`combined`/`supervisor-netns` preserves the local hard netns path. Kubernetes +defaults to `workload`/`soft-proxy`, which keeps the pod unprivileged and +reports that direct sockets are not kernel-blocked. `external-enforcer` +delegates the coarse direct-egress boundary to a host/node component while +leaving dynamic endpoint policy in the proxy. + The supervisor may enrich baseline filesystem allowances for runtime-required paths, such as proxy support files or GPU device paths when a GPU is present. diff --git a/crates/openshell-cli/src/main.rs b/crates/openshell-cli/src/main.rs index ac2b50392..062214c23 100644 --- a/crates/openshell-cli/src/main.rs +++ b/crates/openshell-cli/src/main.rs @@ -4387,6 +4387,16 @@ mod tests { } } + #[test] + fn sandbox_create_rejects_privileged_flag() { + let err = Cli::try_parse_from(["openshell", "sandbox", "create", "--privileged"]) + .expect_err("privileged must not be a per-sandbox CLI flag"); + assert!( + err.to_string().contains("--privileged"), + "error should identify the rejected flag" + ); + } + #[test] fn service_expose_accepts_positional_target_port_and_service() { let cli = Cli::try_parse_from([ diff --git a/crates/openshell-core/src/sandbox_env.rs b/crates/openshell-core/src/sandbox_env.rs index 1059c0d08..70ad471da 100644 --- a/crates/openshell-core/src/sandbox_env.rs +++ b/crates/openshell-core/src/sandbox_env.rs @@ -63,3 +63,138 @@ pub const USER_ENVIRONMENT: &str = "OPENSHELL_USER_ENVIRONMENT"; /// writes and rotates this file; the supervisor exchanges its contents /// for a gateway JWT at startup and on refresh. pub const K8S_SA_TOKEN_FILE: &str = "OPENSHELL_K8S_SA_TOKEN_FILE"; + +/// Runtime role selected for the sandbox supervisor binary. +pub const SUPERVISOR_ROLE: &str = "OPENSHELL_SUPERVISOR_ROLE"; + +/// Network enforcement mode selected for the sandbox supervisor binary. +pub const NETWORK_ENFORCEMENT_MODE: &str = "OPENSHELL_NETWORK_ENFORCEMENT_MODE"; + +/// Endpoint for an external node/host enforcer. +pub const ENFORCER_ENDPOINT: &str = "OPENSHELL_ENFORCER_ENDPOINT"; + +/// Node IP injected by Kubernetes when an external node enforcer is used. +pub const NODE_IP: &str = "OPENSHELL_NODE_IP"; + +/// Pod IP injected by Kubernetes for node-enforcer registration. +pub const POD_IP: &str = "OPENSHELL_POD_IP"; + +#[derive(Debug, Clone, Copy, PartialEq, Eq, Default, serde::Serialize, serde::Deserialize)] +#[serde(rename_all = "kebab-case")] +pub enum SupervisorRole { + /// Runs inside the sandbox/container and owns workload lifecycle. + Workload, + /// Runs as a privileged host/node-side enforcement component. + Enforcer, + /// Current local-style topology: one supervisor owns lifecycle and hard controls. + #[default] + Combined, +} + +impl SupervisorRole { + #[must_use] + pub const fn as_str(self) -> &'static str { + match self { + Self::Workload => "workload", + Self::Enforcer => "enforcer", + Self::Combined => "combined", + } + } +} + +impl std::fmt::Display for SupervisorRole { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + f.write_str(self.as_str()) + } +} + +impl std::str::FromStr for SupervisorRole { + type Err = String; + + fn from_str(value: &str) -> Result { + match value.trim().to_ascii_lowercase().as_str() { + "workload" => Ok(Self::Workload), + "enforcer" => Ok(Self::Enforcer), + "combined" => Ok(Self::Combined), + other => Err(format!( + "unknown supervisor role '{other}'; expected 'workload', 'enforcer', or 'combined'" + )), + } + } +} + +#[derive(Debug, Clone, Copy, PartialEq, Eq, Default, serde::Serialize, serde::Deserialize)] +#[serde(rename_all = "kebab-case")] +pub enum NetworkEnforcementMode { + /// Resolve from the supervisor role and runtime hints. + #[default] + Auto, + /// Cooperative proxy environment only; direct sockets are not kernel-blocked. + SoftProxy, + /// Supervisor-managed netns/veth/nft enforcement. + SupervisorNetns, + /// Enforcement delegated to a node/host enforcer. + ExternalEnforcer, +} + +impl NetworkEnforcementMode { + #[must_use] + pub const fn as_str(self) -> &'static str { + match self { + Self::Auto => "auto", + Self::SoftProxy => "soft-proxy", + Self::SupervisorNetns => "supervisor-netns", + Self::ExternalEnforcer => "external-enforcer", + } + } +} + +impl std::fmt::Display for NetworkEnforcementMode { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + f.write_str(self.as_str()) + } +} + +impl std::str::FromStr for NetworkEnforcementMode { + type Err = String; + + fn from_str(value: &str) -> Result { + match value.trim().to_ascii_lowercase().as_str() { + "auto" => Ok(Self::Auto), + "soft-proxy" => Ok(Self::SoftProxy), + "supervisor-netns" => Ok(Self::SupervisorNetns), + "external-enforcer" => Ok(Self::ExternalEnforcer), + other => Err(format!( + "unknown network enforcement mode '{other}'; expected 'auto', 'soft-proxy', 'supervisor-netns', or 'external-enforcer'" + )), + } + } +} + +#[cfg(test)] +mod tests { + use super::{NetworkEnforcementMode, SupervisorRole}; + + #[test] + fn supervisor_role_round_trips_kebab_case() { + assert_eq!("workload".parse(), Ok(SupervisorRole::Workload)); + assert_eq!(SupervisorRole::Enforcer.to_string(), "enforcer"); + assert_eq!( + serde_json::to_value(SupervisorRole::Combined).unwrap(), + serde_json::json!("combined") + ); + } + + #[test] + fn network_enforcement_mode_round_trips_kebab_case() { + assert_eq!("soft-proxy".parse(), Ok(NetworkEnforcementMode::SoftProxy)); + assert_eq!( + NetworkEnforcementMode::ExternalEnforcer.to_string(), + "external-enforcer" + ); + assert_eq!( + serde_json::to_value(NetworkEnforcementMode::SupervisorNetns).unwrap(), + serde_json::json!("supervisor-netns") + ); + } +} diff --git a/crates/openshell-driver-kubernetes/README.md b/crates/openshell-driver-kubernetes/README.md index 0bdcf3748..dda758cca 100644 --- a/crates/openshell-driver-kubernetes/README.md +++ b/crates/openshell-driver-kubernetes/README.md @@ -13,6 +13,30 @@ this driver. Kubernetes owns scheduling and pod lifecycle. The `openshell-sandbox` supervisor inside each workload owns agent isolation, credential injection, policy polling, logs, and the gateway relay. +Set `default_runtime_class_name` in the driver config to assign a default Kubernetes +RuntimeClass, such as `gvisor` or a Kata Containers RuntimeClass, to sandbox +pods. Per-sandbox template `runtime_class_name` values override the driver +default. When `default_runtime_class_name` is configured, the driver validates +that the cluster has that RuntimeClass during startup so a missing runtime fails +fast instead of surfacing later as pod sandbox creation errors. Per-sandbox +RuntimeClass overrides are validated during sandbox +admission/create. As a short-term compatibility escape hatch, the driver can set +`privileged = true` deployment-wide; the driver maps that to +`podTemplate.spec.containers[0].securityContext.privileged` for all sandbox pod +containers. Use it only for trusted clusters that require privileged pod +admission because it weakens the container boundary. + +Kubernetes deployments default to `supervisor_role = "workload"` and +`network_enforcement_mode = "soft-proxy"`. In this mode the supervisor runs the +proxy, policy reload, relay, and agent lifecycle without creating a Linux +network namespace; proxy-aware traffic is enforced, but direct socket egress is +not kernel-blocked. Set `network_enforcement_mode = "supervisor-netns"` to use +the existing netns/veth/nft path when the sandbox pod has the required Linux +capabilities. Set `network_enforcement_mode = "external-enforcer"` to try the +node-enforcer topology; the workload supervisor registers with a node-side +enforcer, which installs coarse pod-netns egress rules while dynamic endpoint +policy stays inside the proxy. + ## Sandbox Resource The driver works with the `agents.x-k8s.io/v1alpha1` `Sandbox` custom resource. diff --git a/crates/openshell-driver-kubernetes/src/config.rs b/crates/openshell-driver-kubernetes/src/config.rs index 48004fa4b..84b186bb3 100644 --- a/crates/openshell-driver-kubernetes/src/config.rs +++ b/crates/openshell-driver-kubernetes/src/config.rs @@ -2,6 +2,7 @@ // SPDX-License-Identifier: Apache-2.0 use openshell_core::config::DEFAULT_SUPERVISOR_IMAGE; +use openshell_core::sandbox_env::{NetworkEnforcementMode, SupervisorRole}; use serde::{Deserialize, Deserializer, Serialize}; use std::str::FromStr; @@ -176,6 +177,19 @@ pub struct KubernetesComputeConfig { deserialize_with = "deserialize_optional_app_armor_profile" )] pub app_armor_profile: Option, + /// Runtime role passed to the sandbox supervisor binary. + pub supervisor_role: SupervisorRole, + /// Network enforcement mode passed to the sandbox supervisor binary. + pub network_enforcement_mode: NetworkEnforcementMode, + /// Endpoint template for a node/host enforcer. Supports Kubernetes env + /// expansion such as `http://$(OPENSHELL_NODE_IP):17671`. + pub enforcer_endpoint: String, + /// Set `securityContext.privileged` on sandbox pod containers. + /// + /// This is a deployment-wide compatibility escape hatch for clusters that + /// require privileged pod admission. It weakens the container boundary and + /// should stay disabled unless the Kubernetes environment is trusted. + pub privileged: bool, pub workspace_default_storage_size: String, /// Default Kubernetes `runtimeClassName` for sandbox pods. /// Applied when a `CreateSandbox` request does not specify one. @@ -221,6 +235,10 @@ impl Default for KubernetesComputeConfig { host_gateway_ip: String::new(), enable_user_namespaces: false, app_armor_profile: None, + supervisor_role: SupervisorRole::Workload, + network_enforcement_mode: NetworkEnforcementMode::SoftProxy, + enforcer_endpoint: String::new(), + privileged: false, workspace_default_storage_size: DEFAULT_WORKSPACE_STORAGE_SIZE.to_string(), default_runtime_class_name: String::new(), sa_token_ttl_secs: 3600, @@ -362,4 +380,39 @@ mod tests { let cfg: KubernetesComputeConfig = serde_json::from_value(json).unwrap(); assert_eq!(cfg.image_pull_secrets, ["regcred", "backup-regcred"]); } + + #[test] + fn serde_override_privileged() { + let json = serde_json::json!({ + "privileged": true + }); + let cfg: KubernetesComputeConfig = serde_json::from_value(json).unwrap(); + assert!(cfg.privileged); + } + + #[test] + fn default_kubernetes_supervisor_mode_is_soft_workload() { + let cfg = KubernetesComputeConfig::default(); + assert_eq!(cfg.supervisor_role, SupervisorRole::Workload); + assert_eq!( + cfg.network_enforcement_mode, + NetworkEnforcementMode::SoftProxy + ); + } + + #[test] + fn serde_override_supervisor_network_mode() { + let json = serde_json::json!({ + "supervisor_role": "combined", + "network_enforcement_mode": "supervisor-netns", + "enforcer_endpoint": "http://$(OPENSHELL_NODE_IP):17671" + }); + let cfg: KubernetesComputeConfig = serde_json::from_value(json).unwrap(); + assert_eq!(cfg.supervisor_role, SupervisorRole::Combined); + assert_eq!( + cfg.network_enforcement_mode, + NetworkEnforcementMode::SupervisorNetns + ); + assert_eq!(cfg.enforcer_endpoint, "http://$(OPENSHELL_NODE_IP):17671"); + } } diff --git a/crates/openshell-driver-kubernetes/src/driver.rs b/crates/openshell-driver-kubernetes/src/driver.rs index 583f3cb99..945993dee 100644 --- a/crates/openshell-driver-kubernetes/src/driver.rs +++ b/crates/openshell-driver-kubernetes/src/driver.rs @@ -9,6 +9,7 @@ use crate::config::{ }; use futures::{Stream, StreamExt, TryStreamExt}; use k8s_openapi::api::core::v1::{Event as KubeEventObj, Node}; +use k8s_openapi::api::node::v1::RuntimeClass; use kube::api::{Api, ApiResource, DeleteParams, ListParams, PostParams}; use kube::core::gvk::GroupVersionKind; use kube::core::{DynamicObject, ObjectMeta}; @@ -28,6 +29,7 @@ use openshell_core::proto::compute::v1::{ GetCapabilitiesResponse, WatchSandboxesDeletedEvent, WatchSandboxesEvent, WatchSandboxesPlatformEvent, WatchSandboxesSandboxEvent, watch_sandboxes_event, }; +use openshell_core::sandbox_env::{NetworkEnforcementMode, SupervisorRole}; use serde::Deserialize; use std::collections::BTreeMap; use std::pin::Pin; @@ -174,25 +176,30 @@ impl std::fmt::Debug for KubernetesComputeDriver { } impl KubernetesComputeDriver { - pub async fn new(config: KubernetesComputeConfig) -> Result { + pub async fn new(mut config: KubernetesComputeConfig) -> Result { let base_config = match kube::Config::incluster() { Ok(c) => c, Err(_) => kube::Config::infer() .await - .map_err(kube::Error::InferConfig)?, + .map_err(kube::Error::InferConfig) + .map_err(KubernetesDriverError::from_kube)?, }; let mut kube_config = base_config.clone(); kube_config.connect_timeout = Some(Duration::from_secs(10)); kube_config.read_timeout = Some(Duration::from_secs(30)); kube_config.write_timeout = Some(Duration::from_secs(30)); - let client = Client::try_from(kube_config)?; + let client = Client::try_from(kube_config).map_err(KubernetesDriverError::from_kube)?; let mut watch_kube_config = base_config; watch_kube_config.connect_timeout = Some(Duration::from_secs(10)); watch_kube_config.read_timeout = None; watch_kube_config.write_timeout = Some(Duration::from_secs(30)); - let watch_client = Client::try_from(watch_kube_config)?; + let watch_client = + Client::try_from(watch_kube_config).map_err(KubernetesDriverError::from_kube)?; + + config.default_runtime_class_name = config.default_runtime_class_name.trim().to_string(); + validate_runtime_class_exists(&client, &config.default_runtime_class_name).await?; Ok(Self { client, @@ -244,6 +251,17 @@ impl KubernetesComputeDriver { })) } + async fn validate_sandbox_runtime_class_override( + &self, + sandbox: &Sandbox, + ) -> Result<(), KubernetesDriverError> { + let Some(runtime_class_name) = sandbox_runtime_class_override(sandbox) else { + return Ok(()); + }; + + validate_runtime_class_exists(&self.client, &runtime_class_name).await + } + pub async fn validate_sandbox_create(&self, sandbox: &Sandbox) -> Result<(), tonic::Status> { let gpu_requested = sandbox.spec.as_ref().is_some_and(|spec| spec.gpu); if gpu_requested @@ -255,6 +273,17 @@ impl KubernetesComputeDriver { "GPU sandbox requested, but the active gateway has no allocatable GPUs. Please refer to documentation and use `openshell doctor` commands to inspect GPU support and gateway configuration.", )); } + self.validate_sandbox_runtime_class_override(sandbox) + .await + .map_err(|err| match err { + KubernetesDriverError::Precondition(message) => { + tonic::Status::failed_precondition(message) + } + KubernetesDriverError::AlreadyExists => { + tonic::Status::already_exists("sandbox already exists") + } + KubernetesDriverError::Message(message) => tonic::Status::internal(message), + })?; Ok(()) } @@ -346,6 +375,9 @@ impl KubernetesComputeDriver { "Creating sandbox in Kubernetes" ); + self.validate_sandbox_runtime_class_override(sandbox) + .await?; + let gvk = GroupVersionKind::gvk(SANDBOX_GROUP, SANDBOX_VERSION, SANDBOX_KIND); let resource = ApiResource::from_gvk(&gvk); let mut obj = DynamicObject::new(name, &resource); @@ -371,6 +403,10 @@ impl KubernetesComputeDriver { host_gateway_ip: &self.config.host_gateway_ip, enable_user_namespaces: self.config.enable_user_namespaces, app_armor_profile: self.config.app_armor_profile.as_ref(), + supervisor_role: self.config.supervisor_role, + network_enforcement_mode: self.config.network_enforcement_mode, + enforcer_endpoint: &self.config.enforcer_endpoint, + privileged: self.config.privileged, workspace_default_storage_size: &self.config.workspace_default_storage_size, default_runtime_class_name: &self.config.default_runtime_class_name, sa_token_ttl_secs: self.config.effective_sa_token_ttl_secs(), @@ -1085,6 +1121,10 @@ struct SandboxPodParams<'a> { host_gateway_ip: &'a str, enable_user_namespaces: bool, app_armor_profile: Option<&'a AppArmorProfile>, + supervisor_role: SupervisorRole, + network_enforcement_mode: NetworkEnforcementMode, + enforcer_endpoint: &'a str, + privileged: bool, workspace_default_storage_size: &'a str, default_runtime_class_name: &'a str, /// Lifetime (seconds) of the projected `ServiceAccount` token used @@ -1110,6 +1150,10 @@ impl Default for SandboxPodParams<'_> { host_gateway_ip: "", enable_user_namespaces: false, app_armor_profile: None, + supervisor_role: SupervisorRole::Workload, + network_enforcement_mode: NetworkEnforcementMode::SoftProxy, + enforcer_endpoint: "", + privileged: false, workspace_default_storage_size: DEFAULT_WORKSPACE_STORAGE_SIZE, default_runtime_class_name: "", sa_token_ttl_secs: 3600, @@ -1117,6 +1161,48 @@ impl Default for SandboxPodParams<'_> { } } +async fn validate_runtime_class_exists( + client: &Client, + runtime_class_name: &str, +) -> Result<(), KubernetesDriverError> { + let runtime_class_name = runtime_class_name.trim(); + if runtime_class_name.is_empty() { + return Ok(()); + } + + let runtime_classes: Api = Api::all(client.clone()); + match tokio::time::timeout(KUBE_API_TIMEOUT, runtime_classes.get(runtime_class_name)).await { + Ok(Ok(_runtime_class)) => { + info!( + runtime_class_name, + "Validated configured Kubernetes RuntimeClass" + ); + Ok(()) + } + Ok(Err(KubeError::Api(err))) if err.code == 404 => { + Err(KubernetesDriverError::Precondition(format!( + "Kubernetes RuntimeClass '{runtime_class_name}' does not exist; create it before starting OpenShell or choose an existing RuntimeClass" + ))) + } + Ok(Err(err)) => Err(KubernetesDriverError::Message(format!( + "failed to validate Kubernetes RuntimeClass '{runtime_class_name}': {err}" + ))), + Err(_elapsed) => Err(KubernetesDriverError::Message(format!( + "timed out after {}s validating Kubernetes RuntimeClass '{runtime_class_name}'", + KUBE_API_TIMEOUT.as_secs() + ))), + } +} + +fn sandbox_runtime_class_override(sandbox: &Sandbox) -> Option { + sandbox + .spec + .as_ref()? + .template + .as_ref() + .and_then(|template| platform_config_string(template, "runtime_class_name")) +} + fn spec_pod_env(spec: Option<&SandboxSpec>) -> std::collections::HashMap { let mut env = spec.map_or_else(Default::default, |s| s.environment.clone()); if let Some(s) = spec.filter(|s| !s.log_level.is_empty()) { @@ -1337,10 +1423,49 @@ fn sandbox_template_to_k8s( params.ssh_socket_path, !params.client_tls_secret_name.is_empty(), ); + let mut env = env; + upsert_env( + &mut env, + openshell_core::sandbox_env::SUPERVISOR_ROLE, + params.supervisor_role.as_str(), + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::NETWORK_ENFORCEMENT_MODE, + params.network_enforcement_mode.as_str(), + ); + if matches!( + params.network_enforcement_mode, + NetworkEnforcementMode::ExternalEnforcer + ) { + upsert_env_field_ref( + &mut env, + openshell_core::sandbox_env::NODE_IP, + "status.hostIP", + ); + upsert_env_field_ref( + &mut env, + openshell_core::sandbox_env::POD_IP, + "status.podIP", + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::ENFORCER_ENDPOINT, + if params.enforcer_endpoint.is_empty() { + "http://$(OPENSHELL_NODE_IP):17671" + } else { + params.enforcer_endpoint + }, + ); + } container.insert("env".to_string(), serde_json::Value::Array(env)); - let mut capabilities: Vec<&str> = vec!["SYS_ADMIN", "NET_ADMIN", "SYS_PTRACE", "SYSLOG"]; + let mut capabilities: Vec<&str> = if pod_uses_supervisor_netns(params) { + vec!["SYS_ADMIN", "NET_ADMIN", "SYS_PTRACE", "SYSLOG"] + } else { + vec!["SYS_PTRACE"] + }; if use_user_namespaces { // In a user namespace the bounding set is reset. SETUID/SETGID are // needed for the supervisor to drop privileges to the sandbox user. @@ -1356,6 +1481,9 @@ fn sandbox_template_to_k8s( if let Some(profile) = params.app_armor_profile { security_context["appArmorProfile"] = app_armor_profile_to_k8s(profile); } + if params.privileged { + security_context["privileged"] = serde_json::json!(true); + } container.insert("securityContext".to_string(), security_context); // Mount client TLS secret for mTLS to the server, plus the projected @@ -1727,12 +1855,44 @@ fn upsert_env(env: &mut Vec, name: &str, value: &str) { env.push(serde_json::json!({"name": name, "value": value})); } +fn upsert_env_field_ref(env: &mut Vec, name: &str, field_path: &str) { + let value = serde_json::json!({ + "name": name, + "valueFrom": { + "fieldRef": { + "fieldPath": field_path + } + } + }); + if let Some(existing) = env + .iter_mut() + .find(|item| item.get("name").and_then(|value| value.as_str()) == Some(name)) + { + *existing = value; + return; + } + + env.push(value); +} + +fn pod_uses_supervisor_netns(params: &SandboxPodParams<'_>) -> bool { + matches!( + params.network_enforcement_mode, + NetworkEnforcementMode::SupervisorNetns + ) || (matches!( + params.network_enforcement_mode, + NetworkEnforcementMode::Auto + ) && matches!(params.supervisor_role, SupervisorRole::Combined)) +} + /// Extract a string value from the template's `platform_config` Struct. fn platform_config_string(template: &SandboxTemplate, key: &str) -> Option { let config = template.platform_config.as_ref()?; let value = config.fields.get(key)?; match value.kind.as_ref() { - Some(prost_types::value::Kind::StringValue(s)) if !s.is_empty() => Some(s.clone()), + Some(prost_types::value::Kind::StringValue(s)) if !s.trim().is_empty() => { + Some(s.trim().to_string()) + } _ => None, } } @@ -1919,6 +2079,51 @@ mod tests { assert!(config.containers.agent.resources.limits.is_empty()); } + fn pod_env_value<'a>(pod_template: &'a serde_json::Value, name: &str) -> Option<&'a str> { + pod_template["spec"]["containers"][0]["env"] + .as_array()? + .iter() + .find(|item| item.get("name").and_then(|value| value.as_str()) == Some(name)) + .and_then(|item| item.get("value")) + .and_then(|value| value.as_str()) + } + + fn pod_env_field_ref<'a>(pod_template: &'a serde_json::Value, name: &str) -> Option<&'a str> { + pod_template["spec"]["containers"][0]["env"] + .as_array()? + .iter() + .find(|item| item.get("name").and_then(|value| value.as_str()) == Some(name)) + .and_then(|item| item.pointer("/valueFrom/fieldRef/fieldPath")) + .and_then(|value| value.as_str()) + } + + #[test] + fn sandbox_runtime_class_override_reads_template_value() { + let sandbox = Sandbox { + spec: Some(SandboxSpec { + template: Some(SandboxTemplate { + platform_config: Some(Struct { + fields: std::iter::once(( + "runtime_class_name".to_string(), + Value { + kind: Some(Kind::StringValue(" kata-qemu ".to_string())), + }, + )) + .collect(), + }), + ..SandboxTemplate::default() + }), + ..SandboxSpec::default() + }), + ..Sandbox::default() + }; + + assert_eq!( + sandbox_runtime_class_override(&sandbox), + Some("kata-qemu".to_string()) + ); + } + #[test] fn kube_pulling_event_adds_image_progress_metadata() { let mut metadata = std::collections::HashMap::new(); @@ -2485,6 +2690,95 @@ mod tests { ); } + #[test] + fn template_runtime_class_name_is_trimmed() { + let template = SandboxTemplate { + platform_config: Some(Struct { + fields: std::iter::once(( + "runtime_class_name".to_string(), + Value { + kind: Some(Kind::StringValue(" gvisor ".to_string())), + }, + )) + .collect(), + }), + ..SandboxTemplate::default() + }; + let pod_template = sandbox_template_to_k8s( + &template, + false, + &std::collections::HashMap::new(), + true, + &SandboxPodParams::default(), + ); + + assert_eq!( + pod_template["spec"]["runtimeClassName"], + serde_json::json!("gvisor") + ); + } + + #[test] + fn kubernetes_default_sets_workload_soft_proxy_env() { + let pod_template = sandbox_template_to_k8s( + &SandboxTemplate::default(), + false, + &std::collections::HashMap::new(), + true, + &SandboxPodParams::default(), + ); + + assert_eq!( + pod_env_value(&pod_template, openshell_core::sandbox_env::SUPERVISOR_ROLE), + Some("workload") + ); + assert_eq!( + pod_env_value( + &pod_template, + openshell_core::sandbox_env::NETWORK_ENFORCEMENT_MODE + ), + Some("soft-proxy") + ); + } + + #[test] + fn external_enforcer_mode_injects_node_registration_env() { + let params = SandboxPodParams { + network_enforcement_mode: NetworkEnforcementMode::ExternalEnforcer, + ..Default::default() + }; + let pod_template = sandbox_template_to_k8s( + &SandboxTemplate::default(), + false, + &std::collections::HashMap::new(), + true, + ¶ms, + ); + + assert_eq!( + pod_env_value( + &pod_template, + openshell_core::sandbox_env::NETWORK_ENFORCEMENT_MODE + ), + Some("external-enforcer") + ); + assert_eq!( + pod_env_value( + &pod_template, + openshell_core::sandbox_env::ENFORCER_ENDPOINT + ), + Some("http://$(OPENSHELL_NODE_IP):17671") + ); + assert_eq!( + pod_env_field_ref(&pod_template, openshell_core::sandbox_env::NODE_IP), + Some("status.hostIP") + ); + assert_eq!( + pod_env_field_ref(&pod_template, openshell_core::sandbox_env::POD_IP), + Some("status.podIP") + ); + } + #[test] fn gpu_sandbox_preserves_existing_resource_limits() { use openshell_core::proto::compute::v1::DriverResourceRequirements; @@ -2816,7 +3110,7 @@ mod tests { ); assert_eq!( pod_template["spec"]["containers"][0]["securityContext"]["capabilities"]["add"][0], - serde_json::json!("SYS_ADMIN"), + serde_json::json!("SYS_PTRACE"), "AppArmor rendering must preserve required capabilities" ); } @@ -2855,7 +3149,7 @@ mod tests { let caps = pod_template["spec"]["containers"][0]["securityContext"]["capabilities"]["add"] .as_array() .unwrap(); - assert_eq!(caps.len(), 4); + assert_eq!(caps, &[serde_json::json!("SYS_PTRACE")]); assert!(!caps.contains(&serde_json::json!("SETUID"))); } @@ -2875,14 +3169,34 @@ mod tests { let caps = pod_template["spec"]["containers"][0]["securityContext"]["capabilities"]["add"] .as_array() .unwrap(); - assert!(caps.contains(&serde_json::json!("SYS_ADMIN"))); - assert!(caps.contains(&serde_json::json!("NET_ADMIN"))); assert!(caps.contains(&serde_json::json!("SYS_PTRACE"))); - assert!(caps.contains(&serde_json::json!("SYSLOG"))); assert!(caps.contains(&serde_json::json!("SETUID"))); assert!(caps.contains(&serde_json::json!("SETGID"))); assert!(caps.contains(&serde_json::json!("DAC_READ_SEARCH"))); - assert_eq!(caps.len(), 7); + assert_eq!(caps.len(), 4); + } + + #[test] + fn supervisor_netns_mode_adds_network_capabilities() { + let params = SandboxPodParams { + network_enforcement_mode: NetworkEnforcementMode::SupervisorNetns, + ..Default::default() + }; + let pod_template = sandbox_template_to_k8s( + &SandboxTemplate::default(), + false, + &std::collections::HashMap::new(), + true, + ¶ms, + ); + let caps = pod_template["spec"]["containers"][0]["securityContext"]["capabilities"]["add"] + .as_array() + .unwrap(); + assert!(caps.contains(&serde_json::json!("SYS_ADMIN"))); + assert!(caps.contains(&serde_json::json!("NET_ADMIN"))); + assert!(caps.contains(&serde_json::json!("SYS_PTRACE"))); + assert!(caps.contains(&serde_json::json!("SYSLOG"))); + assert_eq!(caps.len(), 4); } #[test] @@ -2956,11 +3270,48 @@ mod tests { .unwrap(); assert_eq!( caps.len(), - 4, + 1, "extra capabilities must not be added when user namespaces are disabled" ); } + #[test] + fn configured_privileged_sets_container_security_context() { + let params = SandboxPodParams { + privileged: true, + ..SandboxPodParams::default() + }; + let pod_template = sandbox_template_to_k8s( + &SandboxTemplate::default(), + false, + &std::collections::HashMap::new(), + true, + ¶ms, + ); + + assert_eq!( + pod_template["spec"]["containers"][0]["securityContext"]["privileged"], + serde_json::json!(true), + "privileged config must map to container securityContext" + ); + } + + #[test] + fn privileged_omitted_by_default() { + let pod_template = sandbox_template_to_k8s( + &SandboxTemplate::default(), + false, + &std::collections::HashMap::new(), + true, + &SandboxPodParams::default(), + ); + + assert!( + pod_template["spec"]["containers"][0]["securityContext"]["privileged"].is_null(), + "privileged must be omitted unless configured" + ); + } + #[test] fn automount_service_account_token_is_disabled() { let pod_template = { diff --git a/crates/openshell-driver-kubernetes/src/main.rs b/crates/openshell-driver-kubernetes/src/main.rs index a2b0e2790..ae714e655 100644 --- a/crates/openshell-driver-kubernetes/src/main.rs +++ b/crates/openshell-driver-kubernetes/src/main.rs @@ -9,6 +9,7 @@ use tracing_subscriber::EnvFilter; use openshell_core::VERSION; use openshell_core::proto::compute::v1::compute_driver_server::ComputeDriverServer; +use openshell_core::sandbox_env::{NetworkEnforcementMode, SupervisorRole}; use openshell_driver_kubernetes::{ AppArmorProfile, ComputeDriverService, DEFAULT_SANDBOX_SERVICE_ACCOUNT_NAME, KubernetesComputeConfig, KubernetesComputeDriver, SupervisorSideloadMethod, @@ -86,6 +87,31 @@ struct Args { #[arg(long, env = "OPENSHELL_K8S_APP_ARMOR_PROFILE")] app_armor_profile: Option, + /// Runtime role passed to openshell-sandbox in workload pods. + #[arg( + long, + env = "OPENSHELL_SUPERVISOR_ROLE", + default_value_t = SupervisorRole::Workload + )] + supervisor_role: SupervisorRole, + + /// Network enforcement mode passed to openshell-sandbox in workload pods. + #[arg( + long, + env = "OPENSHELL_NETWORK_ENFORCEMENT_MODE", + default_value_t = NetworkEnforcementMode::SoftProxy + )] + network_enforcement_mode: NetworkEnforcementMode, + + /// Endpoint template for the optional node/host enforcer. + #[arg(long, env = "OPENSHELL_ENFORCER_ENDPOINT")] + enforcer_endpoint: Option, + + /// Default Kubernetes `runtimeClassName` for sandbox pods. + /// Per-sandbox template `runtime_class_name` values override this default. + #[arg(long, env = "OPENSHELL_K8S_DEFAULT_RUNTIME_CLASS_NAME")] + default_runtime_class_name: Option, + /// Lifetime (seconds) of the projected `ServiceAccount` token /// kubelet writes into each sandbox pod for the `IssueSandboxToken` /// bootstrap exchange. Kubelet enforces a minimum of 600s; the @@ -120,14 +146,17 @@ async fn main() -> Result<()> { host_gateway_ip: args.host_gateway_ip.unwrap_or_default(), enable_user_namespaces: args.enable_user_namespaces, app_armor_profile: args.app_armor_profile, + supervisor_role: args.supervisor_role, + network_enforcement_mode: args.network_enforcement_mode, + enforcer_endpoint: args.enforcer_endpoint.unwrap_or_default(), + privileged: false, workspace_default_storage_size: std::env::var( "OPENSHELL_K8S_WORKSPACE_DEFAULT_STORAGE_SIZE", ) .unwrap_or_else(|_| { openshell_driver_kubernetes::DEFAULT_WORKSPACE_STORAGE_SIZE.to_string() }), - default_runtime_class_name: std::env::var("OPENSHELL_K8S_DEFAULT_RUNTIME_CLASS_NAME") - .unwrap_or_default(), + default_runtime_class_name: args.default_runtime_class_name.unwrap_or_default(), sa_token_ttl_secs: args.sa_token_ttl_secs, }) .await diff --git a/crates/openshell-sandbox/Cargo.toml b/crates/openshell-sandbox/Cargo.toml index 6d527bc53..e36095a48 100644 --- a/crates/openshell-sandbox/Cargo.toml +++ b/crates/openshell-sandbox/Cargo.toml @@ -63,6 +63,7 @@ sha1 = "0.10" # IP network / CIDR parsing ipnet = "2" +url = { workspace = true } # Serialization serde = { workspace = true } diff --git a/crates/openshell-sandbox/src/enforcer.rs b/crates/openshell-sandbox/src/enforcer.rs new file mode 100644 index 000000000..cf3604c4e --- /dev/null +++ b/crates/openshell-sandbox/src/enforcer.rs @@ -0,0 +1,596 @@ +// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// SPDX-License-Identifier: Apache-2.0 + +//! Node/host enforcer entrypoint and workload registration client. +//! +//! The node enforcer runs as a privileged host-PID `DaemonSet`. Workload +//! supervisors register after loading policy and before spawning the agent +//! process. The enforcer locates the pod network namespace by pod IP and +//! installs coarse nftables rules in that namespace: +//! +//! - loopback traffic is allowed so sandbox processes can reach the local proxy +//! - UID 0 traffic is allowed so the supervisor-owned proxy can reach upstreams +//! - non-root TCP/UDP egress is rejected, forcing sandbox-user traffic through +//! the proxy where OPA and L7 policy are evaluated + +use miette::{IntoDiagnostic, Result}; +use openshell_core::sandbox_env::NetworkEnforcementMode; +use serde::Deserialize; +use std::net::{IpAddr, SocketAddr}; +#[cfg(target_os = "linux")] +use std::os::fd::AsRawFd; +#[cfg(target_os = "linux")] +use std::os::unix::process::CommandExt; +#[cfg(target_os = "linux")] +use std::path::{Path, PathBuf}; +#[cfg(target_os = "linux")] +use std::process::Command; +#[cfg(target_os = "linux")] +use std::time::Duration; +use tokio::io::{AsyncReadExt, AsyncWriteExt}; +use tokio::net::{TcpListener, TcpStream}; +use tracing::{debug, info, warn}; +use url::Url; + +pub const DEFAULT_LISTEN_ADDR: &str = "0.0.0.0:17671"; +#[cfg(target_os = "linux")] +const ENFORCER_TABLE: &str = "openshell_external_enforcer"; +#[cfg(target_os = "linux")] +const NFT_SEARCH_PATHS: &[&str] = &["/usr/sbin/nft", "/sbin/nft", "/usr/bin/nft"]; +#[cfg(target_os = "linux")] +const PROC_ROOT: &str = "/proc"; + +#[derive(Debug, Clone)] +pub struct EnforcerRuntimeConfig { + pub listen_addr: SocketAddr, + pub network_enforcement_mode: NetworkEnforcementMode, +} + +#[derive(Debug, Clone)] +pub struct WorkloadRegistration { + pub endpoint: String, + pub sandbox_id: String, + pub sandbox_name: Option, + pub pod_ip: Option, +} + +#[derive(Debug, Deserialize)] +struct RegistrationPayload { + sandbox_id: String, + sandbox_name: Option, + pod_ip: Option, + protocol: Option, +} + +pub async fn run(config: EnforcerRuntimeConfig) -> Result { + let listener = TcpListener::bind(config.listen_addr) + .await + .into_diagnostic()?; + warn!( + listen_addr = %config.listen_addr, + network_enforcement_mode = %config.network_enforcement_mode, + "OpenShell node enforcer started; workload registrations install coarse pod-netns egress rules" + ); + info!( + listen_addr = %config.listen_addr, + "Node enforcer is watching for workload supervisor registrations" + ); + + loop { + let (stream, peer) = listener.accept().await.into_diagnostic()?; + let mode = config.network_enforcement_mode; + tokio::spawn(async move { + if let Err(error) = handle_registration(stream, peer, mode).await { + debug!(error = %error, peer = %peer, "Failed to handle enforcer request"); + } + }); + } +} + +async fn handle_registration( + mut stream: TcpStream, + peer: SocketAddr, + mode: NetworkEnforcementMode, +) -> Result<()> { + let request_bytes = read_http_request(&mut stream).await?; + let request = String::from_utf8_lossy(&request_bytes); + let request_line = request.lines().next().unwrap_or_default(); + let payload = request + .split_once("\r\n\r\n") + .and_then(|(_, body)| serde_json::from_str::(body).ok()); + + if let Some(payload) = payload { + info!( + peer = %peer, + request = request_line, + sandbox_id = %payload.sandbox_id, + sandbox_name = payload.sandbox_name.as_deref().unwrap_or_default(), + pod_ip = payload.pod_ip.as_deref().unwrap_or_default(), + protocol = payload.protocol.as_deref().unwrap_or_default(), + "Observed sandbox workload registration" + ); + info!( + sandbox_id = %payload.sandbox_id, + pod_ip = payload.pod_ip.as_deref().unwrap_or_default(), + network_enforcement_mode = %mode, + action = "install-coarse-egress-enforcement", + "Reconciling sandbox network enforcement" + ); + + if matches!(mode, NetworkEnforcementMode::ExternalEnforcer) { + let target_ip = target_pod_ip(&payload, peer); + if let Some(pod_ip) = target_ip { + let sandbox_id = payload.sandbox_id.clone(); + install_external_enforcement_blocking(sandbox_id, pod_ip).await?; + } else { + warn!( + peer = %peer, + sandbox_id = %payload.sandbox_id, + pod_ip = payload.pod_ip.as_deref().unwrap_or_default(), + "Skipping host-side enforcement because no non-loopback pod IP is available" + ); + } + } + } else { + warn!( + peer = %peer, + request = request_line, + "Accepted workload supervisor registration without parseable payload" + ); + } + + let response = concat!( + "HTTP/1.1 202 Accepted\r\n", + "Content-Length: 0\r\n", + "Connection: close\r\n", + "\r\n" + ); + stream + .write_all(response.as_bytes()) + .await + .into_diagnostic()?; + Ok(()) +} + +async fn read_http_request(stream: &mut TcpStream) -> Result> { + let mut buf = vec![0_u8; 8192]; + let mut len = 0_usize; + let mut header_end = None; + + while len < buf.len() { + let read = stream.read(&mut buf[len..]).await.into_diagnostic()?; + if read == 0 { + break; + } + len += read; + if let Some(pos) = find_header_end(&buf[..len]) { + header_end = Some(pos); + break; + } + } + + let Some(header_end) = header_end else { + buf.truncate(len); + return Ok(buf); + }; + + let headers = String::from_utf8_lossy(&buf[..header_end]); + let content_length = headers + .lines() + .find_map(|line| line.split_once(':')) + .and_then(|(name, value)| { + name.eq_ignore_ascii_case("content-length") + .then(|| value.trim().parse::().ok()) + .flatten() + }) + .unwrap_or(0); + let request_len = header_end + 4 + content_length; + + if request_len > buf.len() { + buf.resize(request_len, 0); + } + while len < request_len { + let read = stream + .read(&mut buf[len..request_len]) + .await + .into_diagnostic()?; + if read == 0 { + break; + } + len += read; + } + + buf.truncate(len); + Ok(buf) +} + +fn find_header_end(buf: &[u8]) -> Option { + buf.windows(4).position(|window| window == b"\r\n\r\n") +} + +fn target_pod_ip(payload: &RegistrationPayload, peer: SocketAddr) -> Option { + payload + .pod_ip + .as_deref() + .and_then(|ip| ip.parse::().ok()) + .or_else(|| (!peer.ip().is_loopback()).then_some(peer.ip())) + .filter(|ip| !ip.is_loopback()) +} + +#[cfg(target_os = "linux")] +async fn install_external_enforcement_blocking(sandbox_id: String, pod_ip: IpAddr) -> Result<()> { + tokio::task::spawn_blocking(move || install_external_enforcement(&sandbox_id, pod_ip)) + .await + .map_err(|error| miette::miette!("enforcer task panicked: {error}"))? +} + +#[cfg(not(target_os = "linux"))] +fn install_external_enforcement_blocking( + sandbox_id: String, + pod_ip: IpAddr, +) -> std::future::Ready> { + let _ = sandbox_id; + std::future::ready(Err(miette::miette!( + "external node enforcement is only supported on Linux nodes (pod_ip={pod_ip})" + ))) +} + +#[cfg(target_os = "linux")] +fn install_external_enforcement(sandbox_id: &str, pod_ip: IpAddr) -> Result<()> { + info!( + sandbox_id, + pod_ip = %pod_ip, + "Installing sandbox network egress enforcement for pod {pod_ip}" + ); + let netns_path = find_pod_netns_path(pod_ip)?; + let nft_path = find_nft().ok_or_else(|| { + miette::miette!( + "nft binary not found; node enforcer image must include nftables for external enforcement" + ) + })?; + + delete_enforcer_table(&netns_path, &nft_path); + let ruleset = generate_external_enforcer_ruleset(&external_log_prefix(sandbox_id)); + run_nft_ruleset_in_netns(&netns_path, &nft_path, &ruleset)?; + + info!( + sandbox_id, + pod_ip = %pod_ip, + netns = %netns_path.display(), + "Sandbox network egress enforcement installed for pod {pod_ip} in {}", + netns_path.display() + ); + Ok(()) +} + +#[cfg(target_os = "linux")] +fn find_pod_netns_path(pod_ip: IpAddr) -> Result { + let mut last_error = None; + for _ in 0..20 { + match find_pod_netns_path_once(pod_ip) { + Ok(path) => return Ok(path), + Err(error) => last_error = Some(error), + } + std::thread::sleep(Duration::from_millis(100)); + } + + Err(last_error.unwrap_or_else(|| { + miette::miette!("failed to find pod network namespace for pod IP {pod_ip}") + })) +} + +#[cfg(target_os = "linux")] +fn find_pod_netns_path_once(pod_ip: IpAddr) -> Result { + let proc = Path::new(PROC_ROOT); + let entries = std::fs::read_dir(proc).into_diagnostic()?; + for entry in entries.flatten() { + let file_name = entry.file_name(); + let Some(pid) = file_name.to_str() else { + continue; + }; + if !pid.as_bytes().iter().all(u8::is_ascii_digit) { + continue; + } + + let netns_path = proc.join(pid).join("ns/net"); + if !netns_path.exists() { + continue; + } + + if proc_netns_has_local_ip(&proc.join(pid), pod_ip) { + return Ok(netns_path); + } + } + + Err(miette::miette!( + "failed to find pod network namespace for pod IP {pod_ip}" + )) +} + +#[cfg(target_os = "linux")] +fn proc_netns_has_local_ip(proc_pid_path: &Path, pod_ip: IpAddr) -> bool { + match pod_ip { + IpAddr::V4(ip) => { + let fib_trie_path = proc_pid_path.join("net/fib_trie"); + let Ok(fib_trie) = std::fs::read_to_string(&fib_trie_path) else { + return false; + }; + fib_trie_contains_local_address(&fib_trie, &ip.to_string()) + } + IpAddr::V6(ip) => { + let if_inet6_path = proc_pid_path.join("net/if_inet6"); + let Ok(if_inet6) = std::fs::read_to_string(&if_inet6_path) else { + return false; + }; + let compact = ip + .segments() + .iter() + .map(|segment| format!("{segment:04x}")) + .collect::(); + if_inet6 + .lines() + .filter_map(|line| line.split_whitespace().next()) + .any(|address| address.eq_ignore_ascii_case(&compact)) + } + } +} + +#[cfg(target_os = "linux")] +fn fib_trie_contains_local_address(fib_trie: &str, pod_ip: &str) -> bool { + let mut matched_leaf = false; + + for line in fib_trie.lines() { + let mut parts = line.split_whitespace(); + if let (Some(marker), Some(address)) = (parts.next(), parts.next()) + && marker.ends_with("--") + { + matched_leaf = address == pod_ip; + continue; + } + + if matched_leaf && line.split_whitespace().eq(["/32", "host", "LOCAL"]) { + return true; + } + } + + false +} + +#[cfg(target_os = "linux")] +fn delete_enforcer_table(netns_path: &Path, nft_path: &str) { + if let Err(error) = run_nft_args_in_netns( + netns_path, + nft_path, + &["delete", "table", "inet", ENFORCER_TABLE], + ) { + debug!( + error = %error, + netns = %netns_path.display(), + "No prior external enforcer nftables table to delete" + ); + } +} + +#[cfg(target_os = "linux")] +fn run_nft_ruleset_in_netns(netns_path: &Path, nft_path: &str, ruleset: &str) -> Result<()> { + let mut file = tempfile::Builder::new() + .prefix("openshell-external-enforcer-") + .suffix(".nft") + .tempfile() + .into_diagnostic()?; + std::io::Write::write_all(&mut file, ruleset.as_bytes()).into_diagnostic()?; + let path = file.path().to_string_lossy().to_string(); + run_nft_args_in_netns(netns_path, nft_path, &["-f", &path]) +} + +#[cfg(target_os = "linux")] +fn run_nft_args_in_netns(netns_path: &Path, nft_path: &str, args: &[&str]) -> Result<()> { + let netns = std::fs::File::open(netns_path).into_diagnostic()?; + let fd = netns.as_raw_fd(); + let output = { + let mut command = Command::new(nft_path); + command.args(args); + // SAFETY: pre_exec runs in the child after fork and before exec. setns + // is async-signal-safe and only affects the child process. + #[allow(unsafe_code)] + unsafe { + command.pre_exec(move || { + let result = libc::setns(fd, libc::CLONE_NEWNET); + if result != 0 { + return Err(std::io::Error::last_os_error()); + } + Ok(()) + }); + } + command.output().into_diagnostic()? + }; + + if output.status.success() { + return Ok(()); + } + + let stderr = String::from_utf8_lossy(&output.stderr); + Err(miette::miette!( + "nft command failed in pod netns: {}", + stderr.trim() + )) +} + +#[cfg(target_os = "linux")] +fn generate_external_enforcer_ruleset(log_prefix: &str) -> String { + format!( + r#"table inet {ENFORCER_TABLE} {{ + chain output {{ + type filter hook output priority 0; policy accept; + + oifname "lo" accept + ct state established,related accept + meta skuid 0 accept + tcp flags syn limit rate 5/second burst 10 packets log prefix "{log_prefix}" flags skuid + meta nfproto ipv4 meta l4proto tcp reject with icmp type port-unreachable + meta nfproto ipv6 meta l4proto tcp reject with icmpv6 type port-unreachable + meta l4proto udp limit rate 5/second burst 10 packets log prefix "{log_prefix}" flags skuid + meta nfproto ipv4 meta l4proto udp reject with icmp type port-unreachable + meta nfproto ipv6 meta l4proto udp reject with icmpv6 type port-unreachable + }} +}} +"# + ) +} + +#[cfg(target_os = "linux")] +fn external_log_prefix(sandbox_id: &str) -> String { + let short_id: String = sandbox_id + .chars() + .filter(|ch| ch.is_ascii_alphanumeric() || *ch == '-') + .take(16) + .collect(); + format!("openshell:external:{short_id}:") +} + +#[cfg(target_os = "linux")] +fn find_nft() -> Option { + NFT_SEARCH_PATHS + .iter() + .find(|path| Path::new(path).is_file()) + .map(|path| (*path).to_string()) +} + +pub async fn register_workload(registration: WorkloadRegistration) -> Result<()> { + let url = Url::parse(registration.endpoint.trim()).into_diagnostic()?; + if url.scheme() != "http" { + return Err(miette::miette!( + "external enforcer endpoint must use http:// for the prototype registration protocol" + )); + } + + let host = url + .host_str() + .ok_or_else(|| miette::miette!("external enforcer endpoint is missing a host"))?; + let port = url + .port_or_known_default() + .ok_or_else(|| miette::miette!("external enforcer endpoint is missing a port"))?; + let addr = format!("{host}:{port}"); + let path = if url.path().is_empty() || url.path() == "/" { + format!("/v1/sandboxes/{}/register", registration.sandbox_id) + } else { + url.path().to_string() + }; + + let mut stream = TcpStream::connect(&addr).await.into_diagnostic()?; + let body = serde_json::json!({ + "sandbox_id": registration.sandbox_id, + "sandbox_name": registration.sandbox_name, + "pod_ip": registration.pod_ip, + "protocol": "openshell-node-enforcer-prototype-v1" + }) + .to_string(); + let request = format!( + "POST {path} HTTP/1.1\r\nHost: {host}\r\nContent-Type: application/json\r\nContent-Length: {}\r\nConnection: close\r\n\r\n{body}", + body.len() + ); + stream + .write_all(request.as_bytes()) + .await + .into_diagnostic()?; + + let mut response = [0_u8; 128]; + let read = stream.read(&mut response).await.into_diagnostic()?; + let status = String::from_utf8_lossy(&response[..read]); + if status.starts_with("HTTP/1.1 202") || status.starts_with("HTTP/1.1 200") { + info!(endpoint = %registration.endpoint, "External enforcer registration acknowledged"); + return Ok(()); + } + + Err(miette::miette!( + "external enforcer registration failed: {}", + status.lines().next().unwrap_or("empty response") + )) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[tokio::test] + async fn register_workload_accepts_enforcer_ack() { + let listener = TcpListener::bind("127.0.0.1:0").await.unwrap(); + let addr = listener.local_addr().unwrap(); + let server = tokio::spawn(async move { + let (stream, peer) = listener.accept().await.unwrap(); + handle_registration(stream, peer, NetworkEnforcementMode::ExternalEnforcer) + .await + .unwrap(); + }); + + register_workload(WorkloadRegistration { + endpoint: format!("http://{addr}"), + sandbox_id: "sb-123".to_string(), + sandbox_name: Some("demo".to_string()), + pod_ip: None, + }) + .await + .unwrap(); + + server.await.unwrap(); + } + + #[cfg(target_os = "linux")] + #[test] + fn external_ruleset_allows_root_and_rejects_non_root_tcp_udp() { + let ruleset = generate_external_enforcer_ruleset("openshell:external:test:"); + + assert!(ruleset.contains("table inet openshell_external_enforcer")); + assert!(ruleset.contains("oifname \"lo\" accept")); + assert!(ruleset.contains("meta skuid 0 accept")); + assert!(ruleset.contains("meta nfproto ipv4 meta l4proto tcp reject")); + assert!(ruleset.contains("meta nfproto ipv4 meta l4proto udp reject")); + assert!( + ruleset.find("meta skuid 0 accept").unwrap() + < ruleset + .find("meta nfproto ipv4 meta l4proto tcp reject") + .unwrap() + ); + } + + #[cfg(target_os = "linux")] + #[test] + fn fib_trie_match_requires_local_address() { + let host_route = r#" + +-- 10.40.0.0/14 2 0 2 + |-- 10.40.1.33 + /32 host UNICAST + "#; + let pod_local = r#" + +-- 0.0.0.0/0 3 0 5 + |-- 10.40.1.33 + /32 host LOCAL + "#; + + assert!(!fib_trie_contains_local_address(host_route, "10.40.1.33")); + assert!(fib_trie_contains_local_address(pod_local, "10.40.1.33")); + } + + #[test] + fn target_pod_ip_prefers_payload_and_ignores_loopback() { + let payload = RegistrationPayload { + sandbox_id: "sb".to_string(), + sandbox_name: None, + pod_ip: Some("10.40.1.28".to_string()), + protocol: None, + }; + let peer = "127.0.0.1:1234".parse().unwrap(); + assert_eq!( + target_pod_ip(&payload, peer), + Some("10.40.1.28".parse().unwrap()) + ); + + let payload = RegistrationPayload { + sandbox_id: "sb".to_string(), + sandbox_name: None, + pod_ip: None, + protocol: None, + }; + assert_eq!(target_pod_ip(&payload, peer), None); + } +} diff --git a/crates/openshell-sandbox/src/lib.rs b/crates/openshell-sandbox/src/lib.rs index 231b588ea..0cf8db9c2 100644 --- a/crates/openshell-sandbox/src/lib.rs +++ b/crates/openshell-sandbox/src/lib.rs @@ -10,6 +10,7 @@ pub mod bypass_monitor; mod child_env; pub mod debug_rpc; pub mod denial_aggregator; +pub mod enforcer; mod grpc_client; mod identity; pub mod l7; @@ -43,6 +44,7 @@ use std::time::Duration; use tokio::time::timeout; use tracing::{debug, info, trace, warn}; +use openshell_core::sandbox_env::{NetworkEnforcementMode, SupervisorRole}; use openshell_ocsf::{ ActionId, ActivityId, AppLifecycleBuilder, ConfigStateChangeBuilder, DetectionFindingBuilder, DispositionId, FindingInfo, LaunchTypeId, Process as OcsfProcess, ProcessActivityBuilder, @@ -189,6 +191,63 @@ pub use sandbox::apply_supervisor_startup_hardening; /// refreshed. const DEFAULT_ROUTE_REFRESH_INTERVAL_SECS: u64 = 5; +#[derive(Debug, Clone)] +pub struct SupervisorRuntimeConfig { + pub role: SupervisorRole, + pub network_enforcement_mode: NetworkEnforcementMode, + pub enforcer_endpoint: Option, + pub pod_ip: Option, +} + +impl Default for SupervisorRuntimeConfig { + fn default() -> Self { + Self { + role: SupervisorRole::Combined, + network_enforcement_mode: NetworkEnforcementMode::Auto, + enforcer_endpoint: None, + pod_ip: None, + } + } +} + +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum EffectiveNetworkEnforcementMode { + SoftProxy, + SupervisorNetns, + ExternalEnforcer, +} + +impl EffectiveNetworkEnforcementMode { + #[must_use] + pub const fn as_str(self) -> &'static str { + match self { + Self::SoftProxy => "soft-proxy", + Self::SupervisorNetns => "supervisor-netns", + Self::ExternalEnforcer => "external-enforcer", + } + } +} + +pub fn resolve_network_enforcement_mode( + role: SupervisorRole, + requested: NetworkEnforcementMode, +) -> EffectiveNetworkEnforcementMode { + match requested { + NetworkEnforcementMode::SoftProxy => EffectiveNetworkEnforcementMode::SoftProxy, + NetworkEnforcementMode::SupervisorNetns => EffectiveNetworkEnforcementMode::SupervisorNetns, + NetworkEnforcementMode::ExternalEnforcer => { + EffectiveNetworkEnforcementMode::ExternalEnforcer + } + NetworkEnforcementMode::Auto => { + if matches!(role, SupervisorRole::Workload) { + EffectiveNetworkEnforcementMode::SoftProxy + } else { + EffectiveNetworkEnforcementMode::SupervisorNetns + } + } + } +} + #[derive(Debug, Clone, Copy, PartialEq, Eq)] enum InferenceRouteSource { File, @@ -301,6 +360,7 @@ fn is_managed_child(pid: i32) -> bool { #[allow(clippy::too_many_arguments, clippy::similar_names)] pub async fn run_sandbox( command: Vec, + runtime_config: SupervisorRuntimeConfig, workdir: Option, timeout_secs: u64, interactive: bool, @@ -345,6 +405,30 @@ pub async fn run_sandbox( } } + let effective_network_enforcement = resolve_network_enforcement_mode( + runtime_config.role, + runtime_config.network_enforcement_mode, + ); + info!( + supervisor_role = %runtime_config.role, + requested_network_enforcement = %runtime_config.network_enforcement_mode, + effective_network_enforcement = effective_network_enforcement.as_str(), + "Resolved sandbox supervisor runtime mode" + ); + ocsf_emit!( + ConfigStateChangeBuilder::new(ocsf_ctx()) + .severity(SeverityId::Informational) + .status(StatusId::Success) + .state(StateId::Enabled, "resolved") + .message(format!( + "Supervisor mode resolved [role:{} requested_network:{} effective_network:{}]", + runtime_config.role, + runtime_config.network_enforcement_mode, + effective_network_enforcement.as_str() + )) + .build() + ); + // Load policy and initialize OPA engine let openshell_endpoint_for_proxy = openshell_endpoint.clone(); let sandbox_name_for_agg = sandbox.clone(); @@ -550,11 +634,62 @@ pub async fn run_sandbox( (None, None) }; + if matches!(policy.network.mode, NetworkMode::Proxy) { + match effective_network_enforcement { + EffectiveNetworkEnforcementMode::SoftProxy => { + warn!( + "Network enforcement is running in soft-proxy mode; proxy-aware traffic is enforced, but direct sockets are not kernel-blocked" + ); + ocsf_emit!( + ConfigStateChangeBuilder::new(ocsf_ctx()) + .severity(SeverityId::High) + .status(StatusId::Success) + .state(StateId::Other, "soft-proxy") + .message( + "Network enforcement is soft-proxy only; direct egress bypass is not kernel-blocked" + ) + .build() + ); + } + EffectiveNetworkEnforcementMode::ExternalEnforcer => { + let endpoint = runtime_config.enforcer_endpoint.as_deref().ok_or_else(|| { + miette::miette!( + "external-enforcer mode requires {} to be set", + openshell_core::sandbox_env::ENFORCER_ENDPOINT + ) + })?; + let id = sandbox_id.as_deref().ok_or_else(|| { + miette::miette!("external-enforcer mode requires a sandbox id") + })?; + enforcer::register_workload(enforcer::WorkloadRegistration { + endpoint: endpoint.to_string(), + sandbox_id: id.to_string(), + sandbox_name: sandbox_name_for_agg.clone(), + pod_ip: runtime_config.pod_ip.clone(), + }) + .await?; + ocsf_emit!( + ConfigStateChangeBuilder::new(ocsf_ctx()) + .severity(SeverityId::Informational) + .status(StatusId::Success) + .state(StateId::Other, "external-enforcer") + .message("External network enforcer registration acknowledged") + .build() + ); + } + EffectiveNetworkEnforcementMode::SupervisorNetns => {} + } + } + // Create network namespace for proxy mode (Linux only) // This must be created before the proxy AND SSH server so that SSH // sessions can enter the namespace for network isolation. #[cfg(target_os = "linux")] - let netns = if matches!(policy.network.mode, NetworkMode::Proxy) { + let netns = if matches!(policy.network.mode, NetworkMode::Proxy) + && matches!( + effective_network_enforcement, + EffectiveNetworkEnforcementMode::SupervisorNetns + ) { match NetworkNamespace::create() { Ok(ns) => { // Install bypass detection rules (nftables log + reject). @@ -582,9 +717,9 @@ pub async fn run_sandbox( } Err(e) => { return Err(miette::miette!( - "Network namespace creation failed and proxy mode requires isolation. \ - Ensure CAP_NET_ADMIN and CAP_SYS_ADMIN are available and iproute2 is installed. \ - Error: {e}" + "Network namespace creation failed and supervisor-netns mode requires isolation. \ + Ensure CAP_NET_ADMIN and CAP_SYS_ADMIN are available and iproute2 is installed, \ + or select soft-proxy/external-enforcer mode. Error: {e}" )); } } @@ -714,15 +849,28 @@ pub async fn run_sandbox( let ssh_proxy_url = if matches!(policy.network.mode, NetworkMode::Proxy) { #[cfg(target_os = "linux")] { - netns.as_ref().map(|ns| { - let port = policy - .network - .proxy - .as_ref() - .and_then(|p| p.http_addr) - .map_or(3128, |addr| addr.port()); - format!("http://{}:{port}", ns.host_ip()) - }) + Some(netns.as_ref().map_or_else( + || { + policy + .network + .proxy + .as_ref() + .and_then(|p| p.http_addr) + .map_or_else( + || "http://127.0.0.1:3128".to_string(), + |addr| format!("http://{addr}"), + ) + }, + |ns| { + let port = policy + .network + .proxy + .as_ref() + .and_then(|p| p.http_addr) + .map_or(3128, |addr| addr.port()); + format!("http://{}:{port}", ns.host_ip()) + }, + )) } #[cfg(not(target_os = "linux"))] { @@ -2894,6 +3042,46 @@ mod tests { static ENV_LOCK: LazyLock> = LazyLock::new(|| Mutex::new(())); + #[test] + fn auto_network_mode_resolves_to_soft_for_workload_role() { + assert_eq!( + resolve_network_enforcement_mode( + SupervisorRole::Workload, + NetworkEnforcementMode::Auto, + ), + EffectiveNetworkEnforcementMode::SoftProxy + ); + } + + #[test] + fn auto_network_mode_preserves_current_combined_netns_default() { + assert_eq!( + resolve_network_enforcement_mode( + SupervisorRole::Combined, + NetworkEnforcementMode::Auto, + ), + EffectiveNetworkEnforcementMode::SupervisorNetns + ); + } + + #[test] + fn explicit_network_mode_overrides_role_resolution() { + assert_eq!( + resolve_network_enforcement_mode( + SupervisorRole::Workload, + NetworkEnforcementMode::SupervisorNetns, + ), + EffectiveNetworkEnforcementMode::SupervisorNetns + ); + assert_eq!( + resolve_network_enforcement_mode( + SupervisorRole::Combined, + NetworkEnforcementMode::ExternalEnforcer, + ), + EffectiveNetworkEnforcementMode::ExternalEnforcer + ); + } + #[test] fn bundle_to_resolved_routes_converts_all_fields() { let bundle = openshell_core::proto::GetInferenceBundleResponse { diff --git a/crates/openshell-sandbox/src/main.rs b/crates/openshell-sandbox/src/main.rs index 3c9e21578..9c102985f 100644 --- a/crates/openshell-sandbox/src/main.rs +++ b/crates/openshell-sandbox/src/main.rs @@ -15,7 +15,8 @@ use tracing_subscriber::EnvFilter; use tracing_subscriber::filter::LevelFilter; use tracing_subscriber::{Layer, layer::SubscriberExt, util::SubscriberInitExt}; -use openshell_sandbox::run_sandbox; +use openshell_core::sandbox_env::{NetworkEnforcementMode, SupervisorRole}; +use openshell_sandbox::{SupervisorRuntimeConfig, run_sandbox}; /// Subcommand name used to self-copy the supervisor binary into a shared volume. /// @@ -105,6 +106,34 @@ struct Args { /// Port for health check endpoint. #[arg(long, default_value = "8080")] health_port: u16, + + /// Runtime role for this supervisor process. + #[arg( + long, + env = openshell_core::sandbox_env::SUPERVISOR_ROLE, + default_value_t = SupervisorRole::Combined + )] + supervisor_role: SupervisorRole, + + /// Network enforcement mode for this supervisor process. + #[arg( + long, + env = openshell_core::sandbox_env::NETWORK_ENFORCEMENT_MODE, + default_value_t = NetworkEnforcementMode::Auto + )] + network_enforcement_mode: NetworkEnforcementMode, + + /// Endpoint for a node/host enforcer when using external-enforcer mode. + #[arg(long, env = openshell_core::sandbox_env::ENFORCER_ENDPOINT)] + enforcer_endpoint: Option, + + /// Pod IP supplied by Kubernetes for external enforcer registration. + #[arg(long, env = openshell_core::sandbox_env::POD_IP)] + pod_ip: Option, + + /// Listen address used when this process runs as a node/host enforcer. + #[arg(long, default_value = openshell_sandbox::enforcer::DEFAULT_LISTEN_ADDR)] + enforcer_listen_addr: std::net::SocketAddr, } /// Copy the running executable to `dest`, creating parent directories as @@ -287,6 +316,28 @@ fn main() -> Result<()> { vec!["/bin/bash".to_string()] }; + if matches!(args.supervisor_role, SupervisorRole::Enforcer) { + info!( + listen_addr = %args.enforcer_listen_addr, + network_enforcement_mode = %args.network_enforcement_mode, + "Starting OpenShell node enforcer" + ); + return openshell_sandbox::enforcer::run( + openshell_sandbox::enforcer::EnforcerRuntimeConfig { + listen_addr: args.enforcer_listen_addr, + network_enforcement_mode: args.network_enforcement_mode, + }, + ) + .await; + } + + let runtime_config = SupervisorRuntimeConfig { + role: args.supervisor_role, + network_enforcement_mode: args.network_enforcement_mode, + enforcer_endpoint: args.enforcer_endpoint, + pod_ip: args.pod_ip, + }; + info!(command = ?command, "Starting sandbox"); // Note: "Starting sandbox" stays as plain info!() since the OCSF context // is not yet initialized at this point (run_sandbox hasn't been called). @@ -294,6 +345,7 @@ fn main() -> Result<()> { run_sandbox( command, + runtime_config, args.workdir, args.timeout, args.interactive, diff --git a/crates/openshell-sandbox/src/process.rs b/crates/openshell-sandbox/src/process.rs index 76786a84d..455c02d99 100644 --- a/crates/openshell-sandbox/src/process.rs +++ b/crates/openshell-sandbox/src/process.rs @@ -242,8 +242,11 @@ impl ProcessHandle { for (key, value) in child_env::proxy_env_vars(&proxy_url) { cmd.env(key, value); } - } else if let Some(http_addr) = proxy.http_addr { - let proxy_url = format!("http://{http_addr}"); + } else { + let proxy_url = proxy.http_addr.map_or_else( + || "http://127.0.0.1:3128".to_string(), + |http_addr| format!("http://{http_addr}"), + ); for (key, value) in child_env::proxy_env_vars(&proxy_url) { cmd.env(key, value); } @@ -542,6 +545,10 @@ pub fn drop_privileges(policy: &SandboxPolicy) -> Result<()> { .ok_or_else(|| miette::miette!("Failed to resolve user primary group"))? }; + if nix::unistd::geteuid() == user.uid && nix::unistd::getegid() == group.gid { + return Ok(()); + } + if user_name.is_some() { let user_cstr = CString::new(user.name.clone()).map_err(|_| miette::miette!("Invalid user name"))?; diff --git a/crates/openshell-server/src/cli.rs b/crates/openshell-server/src/cli.rs index 748cec264..7f587cf41 100644 --- a/crates/openshell-server/src/cli.rs +++ b/crates/openshell-server/src/cli.rs @@ -1538,4 +1538,23 @@ default_image = "k8s-specific:1.0" .expect("deserializes"); assert_eq!(parsed.default_image, "k8s-specific:1.0"); } + + #[test] + fn kubernetes_driver_parses_privileged_from_driver_table() { + let file = config_file_from_toml( + r" +[openshell.drivers.kubernetes] +privileged = true +", + ); + let merged = crate::config_file::driver_table( + super::ComputeDriverKind::Kubernetes, + &file.openshell.gateway, + file.openshell.drivers.get("kubernetes"), + ); + let parsed = merged + .try_into::() + .expect("deserializes"); + assert!(parsed.privileged); + } } diff --git a/crates/openshell-server/src/compute/mod.rs b/crates/openshell-server/src/compute/mod.rs index 064eb3857..7b8248fe5 100644 --- a/crates/openshell-server/src/compute/mod.rs +++ b/crates/openshell-server/src/compute/mod.rs @@ -334,7 +334,7 @@ impl ComputeRuntime { ) -> Result { let driver = KubernetesComputeDriver::new(config) .await - .map_err(|err| ComputeError::Message(err.to_string()))?; + .map_err(ComputeError::from)?; let driver: SharedComputeDriver = Arc::new(ComputeDriverService::new(driver)); Self::from_driver( ComputeDriverKind::Kubernetes, diff --git a/deploy/docker/Dockerfile.supervisor b/deploy/docker/Dockerfile.supervisor index c84cc70e9..30ce456ab 100644 --- a/deploy/docker/Dockerfile.supervisor +++ b/deploy/docker/Dockerfile.supervisor @@ -5,10 +5,10 @@ # Supervisor image build. # -# The final image is `scratch`: it only carries the static `openshell-sandbox` -# binary used by Docker extraction, Podman image volumes, and the Kubernetes -# init container copy-self path. A static musl binary lets the image stay -# `scratch` while still being executable as an init container. +# The final image carries the static `openshell-sandbox` binary used by Docker +# extraction, Podman image volumes, and the Kubernetes init container copy-self +# path. It also includes nftables so the same image can run the Kubernetes +# node-enforcer DaemonSet and install pod network-namespace egress rules. # # The Rust binary is built natively before this image build runs and staged at: # deploy/docker/.build/prebuilt-binaries//openshell-sandbox @@ -19,10 +19,12 @@ # target) and uploads it as an artifact, which is downloaded into the same # staging directory before the image build job runs. -FROM scratch AS supervisor +FROM alpine:3.22 AS supervisor ARG TARGETARCH +RUN apk add --no-cache nftables + # --chmod=0550 drops world-execute and survives the actions/upload-artifact # + download-artifact roundtrip (which strips exec perms). Ownership is left # at root (0:0) deliberately: the Podman driver mounts this image as a diff --git a/deploy/helm/openshell/README.md b/deploy/helm/openshell/README.md index ab5b6eb45..c25dbb92d 100644 --- a/deploy/helm/openshell/README.md +++ b/deploy/helm/openshell/README.md @@ -151,6 +151,13 @@ JWT signing Secret. | imagePullSecrets | list | `[]` | Image pull secrets attached to gateway and helper pods. | | nameOverride | string | `"openshell"` | Override the chart name used in generated resource names. | | networkPolicy.enabled | bool | `true` | Create a NetworkPolicy restricting SSH ingress on sandbox pods to the gateway. | +| nodeEnforcer.affinity | object | `{}` | Affinity rules for the node enforcer DaemonSet. | +| nodeEnforcer.enabled | bool | `false` | Deploy the privileged node enforcer DaemonSet. | +| nodeEnforcer.listenAddress | string | `"0.0.0.0:17671"` | Listen address for the node enforcer registration endpoint. | +| nodeEnforcer.logLevel | string | `"info"` | Node enforcer log level. | +| nodeEnforcer.nodeSelector | object | `{}` | Node selector for the node enforcer DaemonSet. | +| nodeEnforcer.resources | object | `{}` | Node enforcer pod resource requests and limits. | +| nodeEnforcer.tolerations | list | `[]` | Tolerations for the node enforcer DaemonSet. | | nodeSelector | object | `{}` | Node selector for the gateway pod. | | pkiInitJob.enabled | bool | `true` | Run a pre-install/pre-upgrade Job that creates gateway and client mTLS Secrets. When certManager.enabled=true, cert-manager owns TLS and this same hook runs in JWT-only mode even if pkiInitJob.enabled remains true. | | pkiInitJob.serverDnsNames | list | `[]` | Extra DNS SANs to append to the server certificate. | @@ -188,14 +195,16 @@ JWT signing Secret. | server.appArmorProfile | string | `"Unconfined"` | Kubernetes AppArmor profile requested for sandbox agent containers. Default Unconfined avoids runtime/default AppArmor blocking the supervisor's network namespace mount setup on AppArmor-enabled nodes. Set to "" to omit the field, "RuntimeDefault" to force the runtime default profile, or "Localhost/profile-name" for an operator-managed localhost profile. | | server.auth.allowUnauthenticatedUsers | bool | `false` | UNSAFE: accept unauthenticated CLI/user requests as a local developer principal. Intended only for trusted local Skaffold/k3d development or a fully trusted fronting proxy. Leave false for shared or production clusters. | | server.dbUrl | string | `"sqlite:/var/openshell/openshell.db"` | Gateway database URL (used for the default SQLite backend). | -| server.defaultRuntimeClassName | string | `""` | Default Kubernetes runtimeClassName for sandbox pods. Applied when a CreateSandbox request does not specify one. Empty (default) = omit the field, using the cluster's default RuntimeClass. Set to a RuntimeClass name (e.g. "kata-containers", "nvidia") to apply it to all sandboxes that don't explicitly override it. | +| server.defaultRuntimeClassName | string | `""` | Default Kubernetes runtimeClassName for sandbox pods. Applied when a CreateSandbox request does not specify one. Empty (default) = omit the field, using the cluster's default RuntimeClass. Set to a RuntimeClass name (e.g. "gvisor", "kata-containers", "nvidia") to apply it to all sandboxes that don't explicitly override it. The gateway validates this RuntimeClass during startup when configured. | | server.disableTls | bool | `false` | Disable TLS entirely - the server listens on plaintext HTTP. Set to true when a reverse proxy / tunnel terminates TLS at the edge. | | server.enableLoopbackServiceHttp | bool | `true` | Enable plaintext HTTP routing for loopback sandbox service URLs on TLS-enabled gateways. | | server.enableUserNamespaces | bool | `false` | Enable Kubernetes user namespace isolation (hostUsers: false) for sandbox pods. Requires Kubernetes 1.33+ with user namespace support available (beta through 1.35, GA in 1.36+), plus a supporting container runtime and Linux 5.12+. When enabled, container UID 0 maps to an unprivileged host UID and capabilities become namespaced. | +| server.enforcerEndpoint | string | `""` | Endpoint template for external-enforcer mode. Empty defaults each sandbox to http://$(OPENSHELL_NODE_IP):17671. | | server.externalDbSecret | string | `""` | Name of a pre-existing Opaque Secret containing a PostgreSQL connection URI (key: uri). When set, the gateway reads OPENSHELL_DB_URL from this Secret instead of using dbUrl. The Secret must contain a `uri` key, e.g. postgresql://user:pass@host:5432/dbname. | | server.grpcEndpoint | string | `""` | gRPC endpoint sandboxes call back into the gateway. Leave empty to derive it from the chart fullname, release namespace, service port, and disableTls flag, for example https://openshell.openshell.svc.cluster.local:8080. Override only when sandboxes must reach the gateway via a different hostname (e.g. an external ingress or a host alias). | | server.hostGatewayIP | string | `""` | Host gateway IP for sandbox pod hostAliases. When set, sandbox pods get hostAliases entries mapping host.docker.internal and host.openshell.internal to this IP, allowing them to reach services running on the Docker host. Auto-detected by the cluster entrypoint script. | | server.logLevel | string | `"info"` | Gateway log level. | +| server.networkEnforcementMode | string | `"soft-proxy"` | Network enforcement mode for sandbox supervisors. soft-proxy keeps pods unprivileged and enforces proxy-aware traffic; direct socket bypass is not kernel-blocked. Use supervisor-netns only with trusted privileged sandbox pods. external-enforcer registers with the optional nodeEnforcer DaemonSet for coarse pod-netns egress blocking while dynamic policy stays in the proxy. | | server.oidc.adminRole | string | `""` | Role name for admin access. Leave empty (with userRole also empty) for authentication-only mode. Both must be set or both empty. | | server.oidc.audience | string | `"openshell-cli"` | Expected audience claim for the API resource server. This should match the server's --oidc-audience, NOT the CLI client ID. | | server.oidc.caConfigMapName | string | `""` | Name of a ConfigMap containing a CA certificate bundle (key: ca.crt) for verifying the OIDC issuer's TLS certificate. Required when the issuer uses a non-public CA (e.g. OpenShift ingress, private PKI). | @@ -213,6 +222,8 @@ JWT signing Secret. | server.sandboxJwt.signingSecretName | string | `""` | Name of the Opaque Secret holding the signing key material. Empty falls back to the chart fullname with "-jwt-keys" appended. | | server.sandboxJwt.ttlSecs | int | `3600` | Token TTL in seconds. Defaults to 3600 (1h). | | server.sandboxNamespace | string | `""` | Namespace where sandbox pods are created. Defaults to the Helm release namespace (.Release.Namespace) when left empty. | +| server.sandboxPrivileged | bool | `false` | Set securityContext.privileged on all sandbox pod containers. This is a short-term compatibility escape hatch for trusted clusters that require privileged pod admission and weakens the container boundary. | +| server.supervisorRole | string | `"workload"` | Runtime role passed to the sandbox supervisor. Kubernetes defaults to workload mode so the injected binary owns agent lifecycle without assuming it can perform host/node-level enforcement. | | server.tls.certSecretName | string | `"openshell-server-tls"` | K8s secret (type kubernetes.io/tls) with tls.crt and tls.key for the server. | | server.tls.clientCaSecretName | string | `"openshell-server-client-ca"` | K8s secret with ca.crt for client certificate verification (mTLS). Set to "" to disable mTLS and run HTTPS-only (use OIDC for auth instead). | | server.tls.clientTlsSecretName | string | `"openshell-client-tls"` | K8s secret mounted into sandbox pods for mTLS to the server. | diff --git a/deploy/helm/openshell/ci/values-node-enforcer.yaml b/deploy/helm/openshell/ci/values-node-enforcer.yaml new file mode 100644 index 000000000..c7e108946 --- /dev/null +++ b/deploy/helm/openshell/ci/values-node-enforcer.yaml @@ -0,0 +1,12 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# CI/dev overlay for exercising the Kubernetes node-enforcer topology. +# Sandbox pods stay unprivileged in workload supervisor mode, while the +# privileged node enforcer DaemonSet owns coarse pod-network-namespace egress +# enforcement. +nodeEnforcer: + enabled: true + +server: + networkEnforcementMode: external-enforcer diff --git a/deploy/helm/openshell/skaffold.yaml b/deploy/helm/openshell/skaffold.yaml index 9a056238a..022fead8c 100644 --- a/deploy/helm/openshell/skaffold.yaml +++ b/deploy/helm/openshell/skaffold.yaml @@ -97,6 +97,8 @@ deploy: #- ci/values-keycloak.yaml # To enable the Gateway API HTTPRoute (requires Envoy Gateway above): #- ci/values-gateway.yaml + # To test the external node-enforcer topology: + #- ci/values-node-enforcer.yaml # To test HA gateway behavior with bundled PostgreSQL: #- ci/values-high-availability.yaml setValueTemplates: diff --git a/deploy/helm/openshell/templates/_helpers.tpl b/deploy/helm/openshell/templates/_helpers.tpl index a8e7ac721..b8cc49a10 100644 --- a/deploy/helm/openshell/templates/_helpers.tpl +++ b/deploy/helm/openshell/templates/_helpers.tpl @@ -48,6 +48,30 @@ app.kubernetes.io/name: {{ include "openshell.name" . }} app.kubernetes.io/instance: {{ .Release.Name }} {{- end }} +{{/* +Selector labels for the node enforcer DaemonSet. + +The node enforcer must not share the gateway Service selector labels because it +does not expose the gateway's named ports. +*/}} +{{- define "openshell.nodeEnforcerName" -}} +{{- printf "%s-node-enforcer" (include "openshell.name" .) | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{- define "openshell.nodeEnforcerSelectorLabels" -}} +app.kubernetes.io/name: {{ include "openshell.nodeEnforcerName" . }} +app.kubernetes.io/instance: {{ .Release.Name }} +{{- end }} + +{{- define "openshell.nodeEnforcerLabels" -}} +helm.sh/chart: {{ include "openshell.chart" . }} +{{ include "openshell.nodeEnforcerSelectorLabels" . }} +{{- if .Chart.AppVersion }} +app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} +{{- end }} +app.kubernetes.io/managed-by: {{ .Release.Service }} +{{- end }} + {{/* Create the name of the service account to use */}} diff --git a/deploy/helm/openshell/templates/clusterrole.yaml b/deploy/helm/openshell/templates/clusterrole.yaml index 30a192fc3..d896ee99b 100644 --- a/deploy/helm/openshell/templates/clusterrole.yaml +++ b/deploy/helm/openshell/templates/clusterrole.yaml @@ -24,3 +24,12 @@ rules: - get - list - watch + # Startup preflight validates the configured RuntimeClass before accepting + # sandbox work so missing classes fail at gateway startup instead of later + # as pod sandbox creation errors. + - apiGroups: + - node.k8s.io + resources: + - runtimeclasses + verbs: + - get diff --git a/deploy/helm/openshell/templates/gateway-config.yaml b/deploy/helm/openshell/templates/gateway-config.yaml index f46547c3f..d24b06fe0 100644 --- a/deploy/helm/openshell/templates/gateway-config.yaml +++ b/deploy/helm/openshell/templates/gateway-config.yaml @@ -100,6 +100,14 @@ data: [openshell.drivers.kubernetes] grpc_endpoint = {{ include "openshell.grpcEndpoint" . | quote }} + supervisor_role = {{ .Values.server.supervisorRole | quote }} + network_enforcement_mode = {{ .Values.server.networkEnforcementMode | quote }} + {{- if .Values.server.enforcerEndpoint }} + enforcer_endpoint = {{ .Values.server.enforcerEndpoint | quote }} + {{- end }} + {{- if .Values.server.sandboxPrivileged }} + privileged = true + {{- end }} service_account_name = {{ include "openshell.sandboxServiceAccountName" . | quote }} supervisor_sideload_method = {{ include "openshell.supervisorSideloadMethod" . | quote }} sa_token_ttl_secs = {{ .Values.server.sandboxJwt.k8sSaTokenTtlSecs | default 3600 }} diff --git a/deploy/helm/openshell/templates/node-enforcer.yaml b/deploy/helm/openshell/templates/node-enforcer.yaml new file mode 100644 index 000000000..3cbf922d5 --- /dev/null +++ b/deploy/helm/openshell/templates/node-enforcer.yaml @@ -0,0 +1,63 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +{{- if .Values.nodeEnforcer.enabled }} +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: {{ include "openshell.fullname" . }}-node-enforcer + labels: + {{- include "openshell.labels" . | nindent 4 }} + app.kubernetes.io/component: node-enforcer +spec: + selector: + matchLabels: + {{- include "openshell.nodeEnforcerSelectorLabels" . | nindent 6 }} + app.kubernetes.io/component: node-enforcer + template: + metadata: + labels: + {{- include "openshell.nodeEnforcerLabels" . | nindent 8 }} + app.kubernetes.io/component: node-enforcer + spec: + hostNetwork: true + hostPID: true + dnsPolicy: ClusterFirstWithHostNet + serviceAccountName: {{ include "openshell.serviceAccountName" . }} + {{- with .Values.imagePullSecrets }} + imagePullSecrets: + {{- toYaml . | nindent 8 }} + {{- end }} + containers: + - name: node-enforcer + image: {{ include "openshell.supervisorImage" . | quote }} + imagePullPolicy: {{ .Values.supervisor.image.pullPolicy | default .Values.image.pullPolicy }} + args: + - --supervisor-role + - enforcer + - --network-enforcement-mode + - external-enforcer + - --enforcer-listen-addr + - {{ .Values.nodeEnforcer.listenAddress | quote }} + - --log-level + - {{ .Values.nodeEnforcer.logLevel | quote }} + securityContext: + privileged: true + allowPrivilegeEscalation: true + {{- with .Values.nodeEnforcer.resources }} + resources: + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.nodeEnforcer.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.nodeEnforcer.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.nodeEnforcer.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} +{{- end }} diff --git a/deploy/helm/openshell/tests/gateway_config_test.yaml b/deploy/helm/openshell/tests/gateway_config_test.yaml index 6b14fe12a..81a92a431 100644 --- a/deploy/helm/openshell/tests/gateway_config_test.yaml +++ b/deploy/helm/openshell/tests/gateway_config_test.yaml @@ -96,6 +96,64 @@ tests: path: data["gateway.toml"] pattern: 'image_pull_secrets\s*=' + - it: omits default_runtime_class_name by default + template: templates/gateway-config.yaml + asserts: + - notMatchRegex: + path: data["gateway.toml"] + pattern: 'default_runtime_class_name\s*=' + + - it: renders the configured default RuntimeClass under [openshell.drivers.kubernetes] + template: templates/gateway-config.yaml + set: + server.defaultRuntimeClassName: gvisor + asserts: + - matchRegex: + path: data["gateway.toml"] + pattern: '(?ms)\[openshell\.drivers\.kubernetes\].*?default_runtime_class_name\s*=\s*"gvisor"' + + - it: renders default supervisor workload soft-proxy mode + template: templates/gateway-config.yaml + asserts: + - matchRegex: + path: data["gateway.toml"] + pattern: '(?ms)\[openshell\.drivers\.kubernetes\].*?supervisor_role\s*=\s*"workload"' + - matchRegex: + path: data["gateway.toml"] + pattern: '(?ms)\[openshell\.drivers\.kubernetes\].*?network_enforcement_mode\s*=\s*"soft-proxy"' + + - it: renders external enforcer endpoint when configured + template: templates/gateway-config.yaml + set: + server.networkEnforcementMode: external-enforcer + server.enforcerEndpoint: http://$(OPENSHELL_NODE_IP):17671 + asserts: + - matchRegex: + path: data["gateway.toml"] + pattern: '(?ms)\[openshell\.drivers\.kubernetes\].*?network_enforcement_mode\s*=\s*"external-enforcer"' + - matchRegex: + path: data["gateway.toml"] + pattern: '(?ms)\[openshell\.drivers\.kubernetes\].*?enforcer_endpoint\s*=\s*"http://\$\(OPENSHELL_NODE_IP\):17671"' + + - it: omits privileged sandbox pods by default + template: templates/gateway-config.yaml + asserts: + - notMatchRegex: + path: data["gateway.toml"] + pattern: 'privileged\s*=' + + - it: renders privileged sandbox pods under [openshell.drivers.kubernetes] + template: templates/gateway-config.yaml + set: + server.sandboxPrivileged: true + asserts: + - matchRegex: + path: data["gateway.toml"] + pattern: '(?ms)\[openshell\.drivers\.kubernetes\].*?privileged\s*=\s*true' + - notMatchRegex: + path: data["gateway.toml"] + pattern: '(?ms)\[openshell\.gateway\][^\[]*?privileged\s*=' + - it: does not render local mTLS user auth for Kubernetes deployments template: templates/gateway-config.yaml asserts: diff --git a/deploy/helm/openshell/tests/node_enforcer_test.yaml b/deploy/helm/openshell/tests/node_enforcer_test.yaml new file mode 100644 index 000000000..364d2299b --- /dev/null +++ b/deploy/helm/openshell/tests/node_enforcer_test.yaml @@ -0,0 +1,33 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +suite: node enforcer +templates: + - templates/node-enforcer.yaml +tests: + - it: does not render by default + asserts: + - hasDocuments: + count: 0 + + - it: renders privileged node enforcer daemonset when enabled + set: + nodeEnforcer.enabled: true + asserts: + - isKind: + of: DaemonSet + - equal: + path: spec.template.spec.hostNetwork + value: true + - equal: + path: spec.template.spec.hostPID + value: true + - equal: + path: spec.template.spec.containers[0].securityContext.privileged + value: true + - equal: + path: spec.template.metadata.labels["app.kubernetes.io/name"] + value: openshell-node-enforcer + - contains: + path: spec.template.spec.containers[0].args + content: external-enforcer diff --git a/deploy/helm/openshell/tests/runtime_class_rbac_test.yaml b/deploy/helm/openshell/tests/runtime_class_rbac_test.yaml new file mode 100644 index 000000000..556e46d49 --- /dev/null +++ b/deploy/helm/openshell/tests/runtime_class_rbac_test.yaml @@ -0,0 +1,22 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +suite: runtime class RBAC +templates: + - templates/clusterrole.yaml +release: + name: openshell + namespace: my-namespace + +tests: + - it: allows the gateway to preflight configured RuntimeClasses + asserts: + - equal: + path: rules[2].apiGroups[0] + value: node.k8s.io + - equal: + path: rules[2].resources[0] + value: runtimeclasses + - equal: + path: rules[2].verbs[0] + value: get diff --git a/deploy/helm/openshell/values.yaml b/deploy/helm/openshell/values.yaml index f0cd43c73..842adfb92 100644 --- a/deploy/helm/openshell/values.yaml +++ b/deploy/helm/openshell/values.yaml @@ -122,6 +122,25 @@ probes: # -- Gateway pod resource requests and limits. resources: {} +# Optional node-side enforcer topology. This runs the supervisor image as a +# privileged DaemonSet so sandbox pods can stay unprivileged while a node +# component owns coarse pod-network-namespace egress enforcement. +nodeEnforcer: + # -- Deploy the privileged node enforcer DaemonSet. + enabled: false + # -- Listen address for the node enforcer registration endpoint. + listenAddress: "0.0.0.0:17671" + # -- Node enforcer log level. + logLevel: info + # -- Node enforcer pod resource requests and limits. + resources: {} + # -- Node selector for the node enforcer DaemonSet. + nodeSelector: {} + # -- Tolerations for the node enforcer DaemonSet. + tolerations: [] + # -- Affinity rules for the node enforcer DaemonSet. + affinity: {} + # -- Node selector for the gateway pod. nodeSelector: {} @@ -161,8 +180,9 @@ server: # -- Default Kubernetes runtimeClassName for sandbox pods. # Applied when a CreateSandbox request does not specify one. # Empty (default) = omit the field, using the cluster's default RuntimeClass. - # Set to a RuntimeClass name (e.g. "kata-containers", "nvidia") to apply it - # to all sandboxes that don't explicitly override it. + # Set to a RuntimeClass name (e.g. "gvisor", "kata-containers", "nvidia") to + # apply it to all sandboxes that don't explicitly override it. The gateway + # validates this RuntimeClass during startup when configured. defaultRuntimeClassName: "" # -- gRPC endpoint sandboxes call back into the gateway. Leave empty to derive # it from the chart fullname, release namespace, service port, and @@ -189,6 +209,23 @@ server: # the field, "RuntimeDefault" to force the runtime default profile, or # "Localhost/profile-name" for an operator-managed localhost profile. appArmorProfile: "Unconfined" + # -- Runtime role passed to the sandbox supervisor. Kubernetes defaults to + # workload mode so the injected binary owns agent lifecycle without assuming + # it can perform host/node-level enforcement. + supervisorRole: workload + # -- Network enforcement mode for sandbox supervisors. soft-proxy keeps pods + # unprivileged and enforces proxy-aware traffic; direct socket bypass is not + # kernel-blocked. Use supervisor-netns only with trusted privileged sandbox + # pods. external-enforcer registers with the optional nodeEnforcer DaemonSet + # for coarse pod-netns egress blocking while dynamic policy stays in the proxy. + networkEnforcementMode: soft-proxy + # -- Endpoint template for external-enforcer mode. Empty defaults each + # sandbox to http://$(OPENSHELL_NODE_IP):17671. + enforcerEndpoint: "" + # -- Set securityContext.privileged on all sandbox pod containers. This is a + # short-term compatibility escape hatch for trusted clusters that require + # privileged pod admission and weakens the container boundary. + sandboxPrivileged: false # -- Disable TLS entirely - the server listens on plaintext HTTP. # Set to true when a reverse proxy / tunnel terminates TLS at the edge. disableTls: false diff --git a/docs/kubernetes/setup.mdx b/docs/kubernetes/setup.mdx index b21321143..a7671290a 100644 --- a/docs/kubernetes/setup.mdx +++ b/docs/kubernetes/setup.mdx @@ -139,6 +139,7 @@ The most commonly changed values are: | `server.sandboxNamespace` | Namespace where sandbox pods are created. Defaults to the Helm release namespace when left empty. | | `server.sandboxImage` | Default sandbox image used when a sandbox does not specify one. | | `server.sandboxImagePullSecrets` | Image pull secrets attached to sandbox pods. Referenced Secrets must exist in the sandbox namespace. | +| `server.defaultRuntimeClassName` | Default Kubernetes RuntimeClass for sandbox pods, such as `gvisor` or a Kata RuntimeClass name. Leave empty to use the cluster default. The gateway validates this RuntimeClass at startup when configured. | | `server.grpcEndpoint` | Endpoint that sandbox supervisors use to call back to the gateway. Must be reachable from inside the cluster. | | `server.appArmorProfile` | AppArmor profile requested for sandbox agent containers. Defaults to `Unconfined`. | | `server.disableTls` | Run the gateway over plaintext HTTP. Use only behind a trusted transport. | @@ -206,6 +207,7 @@ The ClusterRole grants node inspection and token validation: |---|---|---| | `authentication.k8s.io` | `tokenreviews` | create | | `""` | `nodes` | get, list, watch | +| `node.k8s.io` | `runtimeclasses` | get | To use an existing ServiceAccount instead of creating one, set `serviceAccount.create=false` and supply its name: diff --git a/docs/reference/gateway-config.mdx b/docs/reference/gateway-config.mdx index c70d8acbd..50c051d87 100644 --- a/docs/reference/gateway-config.mdx +++ b/docs/reference/gateway-config.mdx @@ -169,6 +169,13 @@ image_pull_policy = "IfNotPresent" image_pull_secrets = ["regcred"] supervisor_image = "ghcr.io/nvidia/openshell/supervisor:latest" supervisor_image_pull_policy = "IfNotPresent" +supervisor_role = "workload" +network_enforcement_mode = "soft-proxy" +# Optional for network_enforcement_mode = "external-enforcer". +enforcer_endpoint = "http://$(OPENSHELL_NODE_IP):17671" +# Short-term compatibility escape hatch for trusted clusters that require +# privileged sandbox pod admission. +privileged = false # Use the image volume on Kubernetes >= 1.35 (GA in 1.36); switch to "init-container" # on older clusters or where the ImageVolume feature gate is off. supervisor_sideload_method = "image-volume" @@ -181,11 +188,27 @@ app_armor_profile = "Unconfined" workspace_default_storage_size = "10Gi" # Kubernetes RuntimeClass applied to sandbox pods when the API request does # not specify one. Empty (default) = omit the field, using the cluster default. -# default_runtime_class_name = "kata-containers" +default_runtime_class_name = "gvisor" # Kubelet clamps projected tokens below 600 seconds. The driver caps values at 86400. sa_token_ttl_secs = 3600 ``` +When `default_runtime_class_name` is set, the Kubernetes driver validates that the +RuntimeClass exists during startup. Missing RuntimeClasses fail the gateway +early instead of surfacing later as pod sandbox creation errors. Per-sandbox +template RuntimeClass overrides are validated during sandbox admission/create. + +`privileged = true` sets `securityContext.privileged` on every Kubernetes +sandbox pod container. Use it only as a short-term compatibility escape hatch +for trusted clusters that require privileged pod admission. + +`network_enforcement_mode = "soft-proxy"` lets Kubernetes sandbox pods run +without supervisor-managed netns setup. The proxy still enforces +proxy-aware traffic and hot-reloads `network_policies`, but direct socket egress +is not kernel-blocked in this mode. Use `supervisor-netns` for the current hard +netns/veth/nft path, or `external-enforcer` to use the node-enforcer topology +for coarse pod-netns egress blocking while dynamic policy stays in the proxy. + ### Docker Sandboxes run as containers on a local bridge network. The supervisor binary is bind-mounted from the host (no in-cluster image pull required); guest mTLS material is supplied as host paths. diff --git a/docs/reference/sandbox-compute-drivers.mdx b/docs/reference/sandbox-compute-drivers.mdx index 229bb1bdb..be2e1c989 100644 --- a/docs/reference/sandbox-compute-drivers.mdx +++ b/docs/reference/sandbox-compute-drivers.mdx @@ -145,6 +145,11 @@ For maintainer-level implementation details, refer to the [Kubernetes driver REA | `default_image` | `server.sandboxImage` | Set the default sandbox image. | | `image_pull_policy` | `server.sandboxImagePullPolicy` | Set the Kubernetes image pull policy for sandbox pods. | | `image_pull_secrets` | `server.sandboxImagePullSecrets` | Attach Kubernetes image pull secrets to sandbox pods. Referenced Secrets must exist in the sandbox namespace. | +| `default_runtime_class_name` | `server.defaultRuntimeClassName` | Set the default Kubernetes RuntimeClass for sandbox pods, such as `gvisor` for GKE Sandbox or a Kata RuntimeClass name. The gateway validates this class at startup when configured. Per-sandbox template runtime classes override this default and are validated during sandbox admission/create. | +| `supervisor_role` | `server.supervisorRole` | Select the supervisor topology role for sandbox pods. Kubernetes defaults to `workload`; local non-Kubernetes supervisors keep `combined` by default. | +| `network_enforcement_mode` | `server.networkEnforcementMode` | Select sandbox network posture. `soft-proxy` injects proxy env vars and enforces cooperative proxy traffic without kernel-blocking direct sockets. `supervisor-netns` uses the existing supervisor-managed netns/veth/nft path and requires the needed Linux capabilities. `external-enforcer` registers with a node/host enforcer that installs coarse pod-netns egress rules while dynamic policy stays in the proxy. | +| `enforcer_endpoint` | `server.enforcerEndpoint` | Optional endpoint template for `external-enforcer` mode. Empty defaults sandbox pods to `http://$(OPENSHELL_NODE_IP):17671`. | +| `privileged` | `server.sandboxPrivileged` | Set `securityContext.privileged` on all Kubernetes sandbox pod containers. This is a short-term compatibility escape hatch for trusted clusters that require privileged pod admission and reduces the container boundary. | | `grpc_endpoint` | `server.grpcEndpoint` | Set the gateway callback endpoint reachable from sandbox pods. | | `client_tls_secret_name` | `server.tls.clientTlsSecretName` | Mount sandbox client TLS materials from a Kubernetes secret. | | `supervisor_image` | `supervisor.image.repository` / `supervisor.image.tag` | Set the supervisor image that provides the `openshell-sandbox` binary. | @@ -156,6 +161,13 @@ For maintainer-level implementation details, refer to the [Kubernetes driver REA The Kubernetes driver creates namespaced `agents.x-k8s.io/v1alpha1` `Sandbox` resources from the Kubernetes SIG Apps [agent-sandbox](https://github.com/kubernetes-sigs/agent-sandbox) project. The Agent Sandbox controller turns those resources into sandbox pods and related storage. +`nodeEnforcer.enabled=true` deploys an optional privileged DaemonSet that runs +the supervisor image in `enforcer` mode. This topology is experimental: the +current enforcer accepts workload registrations and installs coarse nftables +egress rules inside the registered pod network namespace. Production hardening +still needs authenticated registration, pod ownership checks, cleanup and +reconciliation behavior, and shared-cluster deployment rules. + `Sandbox.spec.volumeClaimTemplates` is immutable after creation. To change storage configuration, delete the sandbox and create a new one with the updated spec. diff --git a/docs/security/best-practices.mdx b/docs/security/best-practices.mdx index 25e440f5b..9ca1ee98f 100644 --- a/docs/security/best-practices.mdx +++ b/docs/security/best-practices.mdx @@ -77,6 +77,18 @@ This provides defense-in-depth: even if a container escape vulnerability exists, | Risk if enabled with GPU | NVIDIA device plugin compatibility with user namespaces is unverified. OpenShell logs a warning when both GPU and user namespaces are active on the same sandbox. | | Recommendation | Enable on non-GPU clusters running Kubernetes with user namespace support available (1.33+ beta, 1.36+ GA) for stronger host isolation. Test GPU workloads separately before enabling on GPU clusters. | +### Privileged Kubernetes Pods + +Kubernetes gateway deployments can set `securityContext.privileged` on every sandbox pod container for clusters that require privileged pod admission. +This is a short-term compatibility escape hatch and weakens the container boundary. + +| Aspect | Detail | +|---|---| +| Default | Disabled. | +| What you can change | Enable deployment-wide through Helm with `server.sandboxPrivileged: true` or in gateway config with `[openshell.drivers.kubernetes] privileged = true`. | +| Risk if enabled | The container receives privileged Linux capabilities and broader host-facing access. Treat the cluster and workload as trusted. | +| Recommendation | Leave disabled unless a trusted Kubernetes deployment requires the `supervisor-netns` enforcement mode. The Kubernetes default is `soft-proxy`, which avoids privileged pod admission but does not kernel-block direct socket egress. Prefer `external-enforcer` when a cluster can run the privileged node enforcer DaemonSet and sandbox pods should remain unprivileged. | + ### Binary Identity Binding The proxy identifies which binary initiated each connection by reading `/proc//exe` (the kernel-trusted executable path). diff --git a/spike/README.md b/spike/README.md new file mode 100644 index 000000000..c2a1ba9c8 --- /dev/null +++ b/spike/README.md @@ -0,0 +1,385 @@ + + +# Node Enforcer Topology Findings + +Date: 2026-05-29 + +## Diagrams + +- [OpenShell isolation architecture](openshell-isolation-diagram.html) +- [Kubernetes isolation tradeoffs](kubernetes-isolation-tradeoffs-diagram.html) +- [Kubernetes node-enforcer option](kubernetes-node-enforcer-options-diagram.html) + +## Summary + +This branch explored how OpenShell can remove `privileged: true` from sandbox +workload pods while preserving the current network proxy value proposition and +keeping the supervisor responsible for agent lifecycle. + +The core finding is that Linux still needs an elevated component to modify +network enforcement for an already-running sandbox. The viable short-term shape +is to move that authority out of the agent-controlled sandbox container and into +an operator-controlled node component. + +That does not eliminate privilege from the system. It moves privilege to a +smaller, auditable infrastructure component. + +## Problem + +The existing Kubernetes sandbox topology needs Linux permissions that are +problematic for managed clusters and hardened runtimes: + +- Sandbox containers needed elevated network permissions to install bypass + prevention rules. +- Managed runtime environments can reject privileged sandbox pods with: + `Privileged=true is not supported`. +- A per-sandbox `--privileged` CLI flag was the wrong product shape because this + is gateway/operator configuration, not per-sandbox user configuration. +- Runtime-specific isolation layers can still be useful future hardening, but + this spike should not rely on cluster-specific runtime availability. +- Kubernetes `NetworkPolicy` and most CNI-level controls are too static for our + current need: OpenShell must be able to update enforcement for a running + sandbox as policy changes. + +## Prototype Topology + +The implemented prototype splits the supervisor into runtime roles: + +- `combined`: current local-style behavior where one supervisor owns lifecycle + and hard controls. +- `workload`: runs inside the sandbox pod and owns policy loading, proxying, + SSH, logs, and agent lifecycle. +- `enforcer`: runs as a privileged node-side component. + +The Kubernetes topology uses: + +- A workload supervisor inside each sandbox pod. +- A privileged `node-enforcer` DaemonSet, one pod per node. +- A node-local registration flow from workload supervisor to node enforcer. +- Host-side nftables installation into the sandbox pod network namespace. + +The workload supervisor still owns the agent lifecycle. The node enforcer does +not run agent code, does not need provider credentials, and does not need access +to sandbox files. + +## Enforcement Model + +The node enforcer currently installs a coarse nftables table in the sandbox +pod's network namespace: + +- Allow loopback so sandbox processes can reach the local OpenShell proxy. +- Allow established and related connections. +- Allow UID 0 traffic so the supervisor-owned proxy can reach upstreams. +- Reject non-root TCP and UDP egress so sandbox-user traffic must go through the + proxy, where OPA and L7 policy are enforced. + +This restores the observable behavior required by the existing bypass-detection +e2e test: direct raw sockets from sandbox user code fail fast with +`ECONNREFUSED` instead of hanging. + +## Control Flow + +The current prototype enforcement flow is registration-driven: + +1. The gateway creates sandbox pods with `supervisor_role = "workload"` and + `network_enforcement_mode = "external-enforcer"`. + +2. The Kubernetes driver injects node-local routing data into each sandbox pod: + `OPENSHELL_NODE_IP` from `status.hostIP`, `OPENSHELL_POD_IP` from + `status.podIP`, and `OPENSHELL_ENFORCER_ENDPOINT`. If no custom endpoint is + configured, the endpoint defaults to `http://$(OPENSHELL_NODE_IP):17671`. + +3. The node enforcer DaemonSet runs the same supervisor binary with + `supervisor_role = "enforcer"`. In that role it does not start an agent + supervisor; it binds the configured host-network listener and waits for + workload registrations. + +4. The workload supervisor starts normally, loads policy, computes the effective + network enforcement mode, and registers with the node enforcer before + spawning the agent process. The registration is an HTTP `POST` to + `/v1/sandboxes/{sandbox_id}/register` with: + + ```json + { + "sandbox_id": "sb-...", + "sandbox_name": "optional-name", + "pod_ip": "10.x.y.z", + "protocol": "openshell-node-enforcer-prototype-v1" + } + ``` + +5. The node enforcer accepts the registration, chooses the target pod IP from + the payload, and falls back to the peer address only when the peer is + non-loopback. Loopback registrations without a pod IP are accepted but do not + install host-side enforcement. + +6. On Linux, the enforcer finds the target pod network namespace by scanning + `/proc//net/fib_trie` for IPv4 or `/proc//net/if_inet6` for IPv6. + For IPv4 it requires the pod IP to appear as `/32 host LOCAL`, which avoids + selecting the host namespace where the pod IP may exist only as a routed + `UNICAST` entry. + +7. The enforcer opens `/proc//ns/net`, forks `nft` from a trusted absolute + path, calls `setns(CLONE_NEWNET)` in the child process, deletes any prior + OpenShell table, and loads the generated `openshell_external_enforcer` + ruleset into the pod namespace. + +8. Once the table is installed, non-root TCP and UDP egress from the sandbox + namespace is rejected, loopback remains available, established flows remain + available, and UID 0 traffic remains available for the supervisor-owned + proxy. User code must therefore reach the network through the OpenShell proxy + path where policy is enforced. + +The workload supervisor remains in charge of policy loading, proxying, SSH, +settings reloads, logs, and agent lifecycle. The node enforcer owns only +host-side namespace lookup and coarse kernel egress enforcement. + +Current limitations: + +- Reconciliation is registration-triggered, not yet watch-based. +- Re-registration is idempotent because the enforcer deletes and recreates its + owned nftables table. +- Cleanup for deleted pods, restarted pods, and stale namespace state remains a + hardening item. +- Registration currently uses prototype HTTP and must be replaced or wrapped + with strong workload identity before production use. + +## Multiple Gateway Behavior + +The topology can support multiple gateways scheduling sandboxes into the same +cluster because enforcement is node-local and pod-local, not gateway-local: + +- Each sandbox pod resolves its own `OPENSHELL_NODE_IP` and `OPENSHELL_POD_IP` + from Kubernetes downward API fields. +- Each workload supervisor registers with the node enforcer on the node where + its sandbox pod was scheduled. +- The node enforcer installs nftables state into the registered pod's network + namespace. The table name can be the same across sandboxes because each pod + has its own network namespace. +- `sandbox_id` is currently used for logs and nft log prefixes; isolation comes + from the selected pod network namespace, not from a gateway-specific table. + +The important deployment constraint is that the node enforcer should be treated +as shared node infrastructure. A node can only have one process binding the +same host-network listener. If multiple gateway releases each enable their own +node-enforcer DaemonSet on the same nodes with the same listen address, they +will compete for the same port and duplicate the privileged component. + +For a shared cluster, the expected shape is one node-enforcer DaemonSet per +node pool or cluster security boundary, with all participating gateways pointing +their sandbox workloads at that node-local endpoint. If separate gateway +installations need isolated enforcers, they must use separate node pools, node +selectors, or distinct listener ports and matching `OPENSHELL_ENFORCER_ENDPOINT` +templates. + +This also raises the bar for the hardening work. Multi-gateway support needs +registration authorization that understands allowed gateway identities, +namespaces, and sandbox pod labels. The enforcer must verify that a registered +pod IP belongs to an OpenShell-owned sandbox pod on the same node and that the +registering workload is allowed to ask for enforcement on that pod. + +## Validation + +Validated against the local Kubernetes development cluster using the normal +OpenShell gateway path and the node-enforcer Helm overlay. + +The Kubernetes smoke path passed with the node-enforcer overlay: + +```shell +OPENSHELL_E2E_KUBE_CONTEXT=k3d-openshell-dev-tmutch \ +OPENSHELL_E2E_KUBE_EXTRA_VALUES=deploy/helm/openshell/ci/values-node-enforcer.yaml \ +OPENSHELL_E2E_KUBE_BUILD_IMAGES=1 \ +OPENSHELL_E2E_KUBE_TEST=smoke \ + e2e/rust/e2e-kubernetes.sh +``` + +The bypass-detection path also passed with the same overlay: + +```shell +OPENSHELL_E2E_KUBE_CONTEXT=k3d-openshell-dev-tmutch \ +OPENSHELL_E2E_KUBE_EXTRA_VALUES=deploy/helm/openshell/ci/values-node-enforcer.yaml \ +OPENSHELL_E2E_KUBE_BUILD_IMAGES=1 \ +OPENSHELL_E2E_KUBE_TEST=bypass_detection \ + e2e/rust/e2e-kubernetes.sh +``` + +Manual validation also created a sandbox through the gateway and confirmed the +node enforcer acted on the sandbox network namespace: + +```text +Observed sandbox workload registration +Reconciling sandbox network enforcement +Installing sandbox network egress enforcement for registered pod +Sandbox network egress enforcement installed for registered pod +``` + +The manual sandbox check then attempted raw direct TCP from the sandbox user +path and observed a fast rejection instead of a hanging bypass attempt. + +```text +direct-connect-exit-code 111 +``` + +## Key Debug Finding + +The first version looked up a pod network namespace by scanning +`/proc/*/net/fib_trie` for the pod IP. That was insufficient because the host +network namespace also contains pod IPs as routed `UNICAST` entries. + +The enforcer installed rules into the host namespace instead of the sandbox pod +namespace, so bypass attempts timed out rather than being rejected. + +The fix was to require that the pod IP is present as a local address in the +target namespace: + +```text + + +``` + +That distinction selected the sandbox pod namespace instead of the host +namespace and made the unchanged e2e test pass. + +## Product Finding + +This topology shifts the privileged boundary from the sandbox workload to a +node-level infrastructure component. + +That is likely the right direction if the goal is: + +- Agent-controlled workload containers are not privileged. +- Runtime-specific sandboxing is out of scope for this spike, not required. +- OpenShell keeps dynamic network proxy enforcement. +- The supervisor remains responsible for the agent lifecycle. +- Non-Kubernetes deployments can keep the existing combined supervisor mode. + +It is not a claim that the system has no privileged code. On Linux, dynamic +network namespace enforcement needs some trusted component with elevated +authority unless we delegate entirely to a CNI, runtime, or kernel feature. + +## RuntimeClass Is Out Of Scope + +RuntimeClass was useful as an early validation stimulus because it exposed why +privileged sandbox pods are a brittle product shape. It is not part of the final +spike topology. + +The node-enforcer result does not depend on gVisor, Kata, or any other +RuntimeClass. The workload pod starts with normal Kubernetes networking, reports +its pod IP to the node-local enforcer, and the enforcer installs rules into that +pod's network namespace. That path works independently of whether a cluster also +uses a runtime isolation layer. + +RuntimeClass may still be useful later as defense-in-depth, but it should remain +separate from OpenShell's network enforcement model. It does not solve dynamic +per-sandbox policy updates by itself, and it should not be part of this spike's +enforcement design. + +## Why Not Only CNI or NetworkPolicy + +Kubernetes `NetworkPolicy` is too coarse and too asynchronous for our immediate +needs: + +- It is not naturally tied to OpenShell's per-sandbox policy lifecycle. +- It is awkward for fast policy changes on running sandboxes. +- It does not preserve OpenShell's L7 proxy semantics by itself. +- CNI-specific extensions would create provider and plugin dependencies. + +A future CNI or eBPF backend could be worthwhile, but it should be another +backend behind the enforcement interface, not the only implementation. + +## Current Risks + +The prototype proves the topology, but it is not yet ready as a hardened +production component. + +Known risks: + +- Registration is prototype-level HTTP and is not yet strongly authenticated. +- The enforcer trusts pod IP registration too much. +- The node enforcer is privileged and host-networked/host-PID, so compromise has + node-level blast radius. +- Cleanup and reconciliation need to handle deleted pods, restarted pods, and + stale nftables state. +- Multi-gateway deployments need a shared-enforcer model or explicit + per-enforcer scheduling and port separation. The current Helm shape can deploy + one enforcer DaemonSet per release, which is not safe to enable repeatedly on + the same nodes without coordination. +- The current rules are coarse: UID 0 is allowed, non-root TCP/UDP is rejected. +- IPv6 namespace lookup exists conceptually but has not been validated. +- Observability is useful but should become structured enough for operators to + prove what pod/netns/rules were acted on. + +## Hardening Plan + +Before this becomes more than a prototype, the node enforcer should be hardened +around identity, scope, and reconciliation. + +Recommended next steps: + +1. Authenticate registration. + Use Kubernetes-projected workload identity, mTLS, or a gateway-minted token + so the node enforcer can verify the registering sandbox. + +2. Authorize against Kubernetes state. + Verify that the registered pod IP belongs to an OpenShell-owned sandbox pod + scheduled on the same node as the enforcer. In multi-gateway clusters, this + authorization also needs to validate the gateway identity, namespace, and + tenant boundary allowed to register that sandbox. + +3. Reduce privilege where possible. + Determine whether `privileged: true` can be replaced with a narrower set of + Linux capabilities and namespace access. If host PID is still needed, document + why. + +4. Add reconciliation. + Watch OpenShell sandbox pods, ensure expected rules exist, and remove stale + rules when pods are deleted. + +5. Make rule ownership explicit. + Keep all nftables state in an OpenShell-owned table and make installs + idempotent. + +6. Strengthen logs and metrics. + Emit clear events for registration, authorization, selected netns, ruleset + install, cleanup, and failures. + +7. Keep e2e parity non-negotiable. + The existing e2e tests should remain unaltered and passing. Topology changes + should be invisible at the behavioral API layer. + +## Implementation Notes From This Branch + +The branch added configuration and code paths for: + +- `OPENSHELL_SUPERVISOR_ROLE` +- `OPENSHELL_NETWORK_ENFORCEMENT_MODE` +- `OPENSHELL_ENFORCER_ENDPOINT` +- `OPENSHELL_NODE_IP` +- `OPENSHELL_POD_IP` +- Helm `nodeEnforcer` configuration +- Helm `server.supervisorRole` +- Helm `server.networkEnforcementMode` +- Helm `server.sandboxImagePullPolicy` +- Kubernetes driver injection of node and pod IP environment variables +- Supervisor image packaging with `nftables` + +The development cluster currently uses: + +- `server.supervisorRole: workload` +- `server.networkEnforcementMode: external-enforcer` +- `server.sandboxImagePullPolicy: IfNotPresent` +- `nodeEnforcer.enabled: true` + +## Recommendation + +Continue with the node-enforcer topology as the next prototype target. It gives +us the cleanest path to unprivileged sandbox workload pods without giving up the +OpenShell proxy model or requiring a specific runtime class. + +Do not present it as removing privilege entirely. Present it as moving privileged +network enforcement into a narrow, operator-controlled component that can be +authenticated, audited, reconciled, and hardened independently from untrusted +agent workloads. diff --git a/spike/kubernetes-isolation-tradeoffs-diagram.html b/spike/kubernetes-isolation-tradeoffs-diagram.html new file mode 100644 index 000000000..72392f93f --- /dev/null +++ b/spike/kubernetes-isolation-tradeoffs-diagram.html @@ -0,0 +1,434 @@ + + + + + + + + OpenShell Kubernetes Isolation Tradeoffs + + + +
+
+
+

Kubernetes Isolation Tradeoffs

+

+ Current OpenShell keeps the supervisor and agent in one Kubernetes sandbox pod. The split-pod proposals move untrusted agent execution into a separate pod and use platform egress controls, while the Isolation Backend proposal turns the boundary machinery into an explicit contract. +

+
+
current vs proposed states
+
+ +
+ + OpenShell Kubernetes isolation tradeoffs + Comparison of current in-pod isolation, proposed split-pod isolation, and an isolation backend abstraction for OpenShell Kubernetes sandboxes. + + + + + + + + + + + + + + + + + + + + + + + + + + + + A. Current Kubernetes State + single sandbox pod, supervisor builds the inner boundary + + + B. Split-Pod Proposal + trusted supervisor pod, untrusted agent pod, CNI egress boundary + + + C. Isolation Backend Interface + common supervisor contract, backend-specific boundary realization + + + + Sandbox Pod + + + openshell-sandbox supervisor + root prelude, proxy, OPA, credentials + SSH relay, logs, process spawn + requires elevated pod capabilities + + + Inner network namespace + veth route forces egress to proxy + nftables rejects direct TCP/UDP bypass + bypass monitor records attempts + + + Restricted agent process + sandbox user, Landlock, seccomp + binary identity via /proc + + + spawn + + enter namespace + + + Current tradeoffs + + Strong in-pod egress forcing independent of CNI + + Per-binary network policy input still available + - Supervisor privilege sits in same pod as agent + - Blocks restricted-v2 / hardened managed clusters + - Agent and supervisor logs share the workload stream + + + + Supervisor Pod + + + Proxy + OPA + L7 policy gate + + + Credentials + never in agent pod + + + Agent Pod + + + Untrusted image + non-root, no caps + gVisor or Kata + + + Sideload binary + optional SSH/Landlock + + + Kubernetes NetworkPolicy + allow: agent pod to supervisor:3128 + allow: DNS when configured + deny: all other direct egress at CNI layer + + + RuntimeClass + gVisor userspace kernel or Kata VM boundary + + + HTTP_PROXY + + non-cooperative traffic + + + + Split-pod tradeoffs + + Agent pod can run non-root with zero capabilities + + gVisor/Kata reduces host-kernel blast radius + + Separate log streams for audit and agent output + - Per-binary destination restrictions are lost cross-pod + - Enforcement now depends on CNI NetworkPolicy support + + + + Supervisor stays policy authority + proxy, OPA, credentials, audit, inference routing + drives one runtime contract instead of inline topology code + + + Isolation Backend contract + realize boundary, return verifiable handle + fail closed if boundary is not ready + operator-selected, not workload-selected + + + In-pod + netns + nft + status quo path + + + Delegated + split pod + CNI boundary + + + VM / Node + microVM + node agent + + + Two invariants + no unguarded workload egress before verification + no workload execution before boundary ready + + + + + + + + + Abstraction tradeoffs + + Separates policy authority from boundary machinery + + Lets split-pod, VM, or node-agent backends share a contract + + Enables in-pod refactor first, delegated later + - Requires verifiable handoff and fail-closed semantics + - Scope is unsettled: network only vs fs/syscall/process too + + + + Decision Map + The diagrams point to different places for the hard boundary. That changes who must be trusted and which controls are enforceable. + + + Hard egress boundary + current: netns + nftables inside pod + split-pod: CNI NetworkPolicy outside agent pod + backend: selected per environment, verified by contract + + + Kernel isolation + current: runc + seccomp + Landlock + split-pod: gVisor userspace kernel or Kata VM + backend: can represent in-pod, VM, or delegated runtime + + + L7, credentials, inference + current: local supervisor proxy + split-pod: supervisor pod proxy + backend: remains with supervisor policy authority + + + Known loss / risk + current: privileged setup near untrusted workload + split-pod: per-binary policy and deny logs become weaker + backend: verification and versioning become first-class + + + Best fit + local dev, k3d/k3s, environments allowing capabilities + enterprise clusters, OpenShift/Gardener-style restrictions + long-term architecture that lets both shapes coexist + + + + Sources: current code/docs; #899 restricted SCC; #981 split supervisor/agent pod; #1305 proxy/runtime decomposition; #1650 process/network split PR; #1703 external compute driver PR; #1737 Isolation Backend interface. + +
+ +
+
+

Current State

+

The Kubernetes driver creates one Sandbox CR whose pod runs the agent container with supervisor sideloading and elevated capabilities so the supervisor can create the inner network namespace.

+
+
+

Split Pod

+

The proposal moves the proxy and policy authority into a trusted supervisor pod. The agent pod becomes untrusted, non-root, no-capabilities, and optionally gVisor or Kata isolated.

+
+
+

NetworkPolicy

+

Today Helm renders an ingress-only sandbox SSH NetworkPolicy. The split-pod proposal adds egress NetworkPolicy so agent pods can reach only the supervisor proxy and DNS.

+
+
+

Backend Interface

+

The interface proposal keeps the supervisor as policy authority while letting in-pod, split-pod, VM, or node-agent backends realize the isolation boundary with explicit verification.

+
+
+ +
+ Sources used: #899, + #981, + #1305, + PR #1650, + PR #1703, + #1737, + plus local files architecture/compute-runtimes.md, + deploy/helm/openshell/templates/networkpolicy.yaml, and + crates/openshell-driver-kubernetes/src/driver.rs. +
+
+ + diff --git a/spike/kubernetes-node-enforcer-options-diagram.html b/spike/kubernetes-node-enforcer-options-diagram.html new file mode 100644 index 000000000..867a66f05 --- /dev/null +++ b/spike/kubernetes-node-enforcer-options-diagram.html @@ -0,0 +1,412 @@ + + + + + + + + OpenShell Kubernetes Node Enforcer Option + + + +
+
+
+

Kubernetes Node Enforcer Option

+

+ The node-enforcer branch adds a fourth Kubernetes isolation shape: keep the workload supervisor and proxy in the sandbox pod, but move privileged pod-network-namespace enforcement to an operator-controlled DaemonSet on each node. +

+
+
branch: node-enforcer-poc/tmutch
+
+ +
+ + OpenShell Kubernetes node-enforcer option + Diagram of the node-enforcer topology and comparison to supervisor-netns, soft-proxy, split-pod, and isolation backend options. + + + + + + + + + + + + + + + + + + + + + + + + + + + + Option 4: Node-Enforcer Topology + workload supervisor remains in the sandbox pod; privileged netns/nft enforcement moves to a node-local DaemonSet + + + + Gateway + creates Sandbox CR + injects role + enforcement mode + + + Kubernetes Driver + supervisor_role=workload + network=external-enforcer + + + + Kubernetes node + + + node-enforcer DaemonSet + same supervisor image, role=enforcer + hostNetwork + hostPID + privileged + listens on host-network endpoint + + + Namespace lookup + find pod netns by pod IP in /proc + + + Install nft table + setns(CLONE_NEWNET) + nft -f + + + Sandbox pod network namespace + + + Workload supervisor + policy loading, proxy, SSH, logs, agent lifecycle + registers with node enforcer before spawning agent + + + Local proxy + OPA + L7 + creds + + + Agent child + sandbox user + + + openshell_external_enforcer table + + + External services + allowed only through proxy + + + Gateway session + config, relay, logs + + + create sandbox + + inject env + + + HTTP registration + pod IP + sandbox ID + + + install pod-netns nft rules + + + HTTP_PROXY + + UID 0 proxy egress + + non-root direct socket rejected + + gateway callback + + + What moved out of the workload pod? + + netns discovery and nftables installation are node-side infrastructure duties + + workload supervisor keeps policy/proxy/credentials close to the agent + + + Prototype hardening required: authenticate registrations, verify pod ownership, reconcile cleanup/stale state, avoid per-release DaemonSet port conflicts. + + + + Comparison Against Other Kubernetes Options + The options differ mostly by where the hard network boundary lives and whether the workload pod needs elevated privileges. + + + Option + Hard egress boundary + Privileged component + What remains local to agent pod + Main tradeoff + + + Supervisor-netns + inner netns + veth + nft + sandbox workload supervisor + all policy, proxy, fs, process, netns + strong but requires elevated sandbox pod + + + Soft-proxy + none for direct sockets + none for network setup + proxy, policy, lifecycle, logs + deployable but bypass is not kernel-blocked + + + Split-pod + CNI NetworkPolicy + trusted supervisor pod / platform + agent only; optional sideload SSH/Landlock + loses trusted per-binary identity cross-pod + + + Node-enforcer + pod netns nft table installed by DaemonSet + node enforcer only + proxy, OPA, creds, SSH, agent lifecycle + keeps proxy local; moves network privilege out + + + Isolation Backend + selected backend: netns, CNI, node, VM + backend-specific + supervisor as policy authority + needs verifiable handoff and fail-closed contract + + + + How Node-Enforcer Fits the Isolation Backend Abstraction + + + Supervisor contract + policy authority, proxy, audit + register before agent exec + + + Backend handle + pod IP + sandbox ID + node-local enforcer endpoint + + + Backend mechanism + hostPID /proc scan + setns + nft ruleset + + + Required invariant + no agent execution before + boundary is installed/verified + + + + + + + Open questions + authn/authz + cleanup + ownership checks + + Sources: node-enforcer-poc/tmutch branch files architecture/plans/node-enforcer-topology-findings.md, crates/openshell-sandbox/src/enforcer.rs, crates/openshell-sandbox/src/lib.rs, crates/openshell-driver-kubernetes/src/driver.rs, deploy/helm/openshell/templates/node-enforcer.yaml, docs/reference/sandbox-compute-drivers.mdx, plus proposal context from #981 and #1737. + +
+ +
+
+

What It Preserves

+

The workload supervisor still owns proxy, OPA, credential resolution, inference routing, SSH, logs, settings reloads, and agent lifecycle inside the sandbox pod.

+
+
+

What It Moves

+

The node enforcer owns pod network namespace discovery and coarse nftables egress enforcement, moving that privilege out of the sandbox workload pod.

+
+
+

What It Does Not Do

+

It does not split the agent into a separate pod and does not use Kubernetes NetworkPolicy as the hard boundary. Dynamic endpoint and L7 policy still live in the proxy.

+
+
+

Hardening Work

+

The POC needs authenticated registration, pod ownership verification, cleanup/reconciliation for stale namespaces, and shared-cluster deployment rules.

+
+
+ + +
+ + diff --git a/spike/openshell-isolation-diagram.html b/spike/openshell-isolation-diagram.html new file mode 100644 index 000000000..efdb2d4a0 --- /dev/null +++ b/spike/openshell-isolation-diagram.html @@ -0,0 +1,559 @@ + + + + + + + + OpenShell Isolation Architecture + + + +
+
+
+
+ +

OpenShell Networking + Filesystem Isolation

+
+

+ Gateway policy delivery stays separate from sandbox-local enforcement. The supervisor prepares static controls, launches the agent as an unprivileged child, and forces observable egress through the policy proxy. +

+
+
+ Standalone HTML/SVG
+ Generated from repo docs + sandbox code +
+
+ +
+ + OpenShell isolation architecture + A three-panel architecture diagram showing gateway and sandbox boundaries, network egress enforcement, and filesystem startup isolation. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 1. Runtime Boundary + Gateway delivers desired state. Sandbox supervisor owns runtime enforcement. + + + 2. Network Egress Enforcement + Ordinary traffic is forced through the proxy, evaluated, and either forwarded, denied, logged, or routed to inference.local. + + + 3. Filesystem + Process Startup + Static controls are prepared before exec and normally require sandbox recreation to change. + + + + + CLI / SDK / TUI + user-facing API clients + + + + Gateway Control Plane + sandbox state, policy revisions + settings, credentials, inference routes + + + + Compute Runtime + Docker, Podman, K8s, VM + + + Sandbox workload + + + + Supervisor + root prelude, policy load, config polling + proxy, SSH relay, log push + + + + Restricted Agent Child + sandbox user + group + Landlock, seccomp, netns active + + + + OPA + policy engine + + + + Proxy + egress gate + + + + Relay + connect/exec/fs + + + gRPC / HTTP + + + launch request + + + provisions workload + + + outbound supervisor session + config refresh, logs, relays + + + spawn + restrict + + + + + + + + + Agent Process + HTTP(S), git, model SDKs + env: http_proxy / HTTPS_PROXY + + + + Bypass Guards + netns + veth route + nft rejects non-proxy TCP/UDP + + + + Local CONNECT Proxy + proxy endpoint inside netns + terminates selected TLS/L7 + + + + Binary Identity + path, sha256, ancestors + + + + SSRF Hard Blocks + internal IPs unless allowed + + + + OPA L4 Decision + host, port, binary, policy + + + + TLS / L7 Inspection + REST, GraphQL, WebSocket + method, path, operation, fields + + + + External Services + allowed upstream traffic + credentials resolved only in proxy + + + + inference.local + special route, not OPA network + + + + openshell-router + configured model backends + + + + Deny / Audit / Log + OCSF + structured 403 + + + + Policy Proposals + L4 denial summaries to gateway + + + ordinary egress + + bypass attempt + + + + + + allowed + protocol inspect + + forward + + special host + + + deny wins + + L4 DenialEvent + + + + + Policy YAML + filesystem_policy + read_only + read_write + + + + Supervisor Prelude + load + validate policy + enrich baseline paths + workdir and runtime material + + + + Landlock Prepare + open PathFds as root + build ruleset + best_effort or hard_requirement + + + + pre_exec Child + enter netns/mount ns + drop to sandbox user + no supervisor identity socket + + + + Enforce Controls + restrict_self() + seccomp filter + raw socket and privilege paths blocked + + + + Exec Agent + restricted runtime + undeclared paths denied + + + + Read-only paths + /usr, /lib, /etc, /app... + + + + Read-write paths + /sandbox, /tmp, workdir... + + + + Runtime enrichment + proxy support, GPU devices + + + + Blocked by default + undeclared filesystem paths + + + + + + + + + + + + + + + Legend + + client / protocol inspection + + gateway or supervisor control + + policy engine / policy data + + security enforcement / deny path + +
+ +
+
+
+ +

Network Isolation

+
+
    +
  • Traffic is configured for proxy mode in sandbox policy conversion.
  • +
  • The child receives proxy environment variables before exec.
  • +
  • Network namespace and nftables reject direct TCP/UDP bypass attempts.
  • +
+
+ +
+
+ +

Filesystem Isolation

+
+
    +
  • Landlock PathFds are opened while the supervisor still has root privileges.
  • +
  • The child drops to the sandbox user before Landlock and seccomp enforcement.
  • +
  • Read-only, read-write, workdir, and enriched runtime paths define the visible filesystem.
  • +
+
+ +
+
+ +

Dynamic Policy

+
+
    +
  • Network policy can reload for new connections or safely parsed HTTP requests.
  • +
  • Filesystem and process controls are startup-time controls.
  • +
  • L4 denials can become pending gateway policy proposals.
  • +
+
+
+ + +
+ +