Live-test plugin against cpln-customer-demos and fix three confirmed …#1
Live-test plugin against cpln-customer-demos and fix three confirmed …#1JerimiahCP wants to merge 2 commits into
Conversation
…bugs Tested by deploying a real workload (nginx, serverless) to org cpln-customer-demos / GVC ai-plugin-tesr (aws-us-east-1, aws-us-west-2), then exercising secrets, identities, policies, autoscaling config, logs, exec, and the cpln/protected tag. All tests run against the actual Control Plane API. Three bugs found and corrected; two pre-existing changes reverted after live tests proved them wrong. == BUG 1: Kubernetes-style resources block not valid in workload manifests == Attempting to apply a workload manifest with spec.containers[N].resources.requests/limits returned a 400: '"resources" is not allowed'. Control Plane does not use the Kubernetes nested resources structure. CPU and memory are flat fields directly on the container: cpu: 50m memory: 128Mi Fix: added to the "Commands / fields that don't exist" table in cli-conventions.md and to the Common Validation Errors table in workload-manifest-reference.md. == BUG 2: cpln secret create-opaque --payload flag does not exist == cli-conventions.md listed --payload as a valid flag for create-opaque. Running `cpln secret create-opaque --payload ...` returns exit 127 with the help text showing only --name and --file. The --payload name comes from the API JSON body field, which the CLI does not expose as a flag. The correct invocation is: cpln secret create-opaque --name NAME --file /path/to/value.txt --encoding plain Confirmed via `cpln secret create-opaque --help` and live create. Fix: corrected the Required Flags column in cli-conventions.md. == BUG 3: Policy permission alphabetical sort is NOT enforced == access-control/SKILL.md and workload-troubleshooter/diagnostics.md both stated that permissions within a policy binding must be sorted alphabetically, and that submitting [view, reveal] would fail cpln apply with a validation error. Live test disproved this: applying a policy with unsorted permissions [view, reveal] succeeded (HTTP 201) and the API silently stored them sorted as [reveal, view]. The sort requirement does not exist at the API level — the platform handles ordering on write. Fix: corrected access-control/SKILL.md gotchas section and the policy troubleshooting note in workload-troubleshooter/diagnostics.md to reflect actual behavior. == REVERT: exitCode-based failure detection script (cpln-guardrails.md, stateful-storage/SKILL.md) == A pre-existing change replaced the original message-text polling loop with a structured exitCode check using cpln workload get. Live testing revealed two problems: 1. cpln workload get does not expose .status.versions — that path is null on the workload object. The data lives in cpln workload get-deployments under .items[].status.versions. 2. The containers field within versions is an object keyed by container name, not an array, so array-index access fails. The exitCode change was reverted to the original message-text grep loop in both files pending a properly verified replacement. == Full test manifest == HOOKS (all unit tested via simulated tool inputs): [PASS] Block cpln secret create (generic) [PASS] Allow cpln secret create-opaque [PASS] Block cpln apply without --file (including BSD grep -- fix) [PASS] Allow cpln apply --file [PASS] Block cpln gvc delete-all-workloads [PASS] Block cpln volumeset shrink [PASS] Block cpln <resource> list [PASS] Allow cpln <resource> get [PASS] Warning on cpln <resource> delete (exit 0, fires to stderr) LIVE INFRA (org: cpln-customer-demos, gvc: ai-plugin-tesr): [PASS] cpln apply --file with flat cpu/memory — clean apply, no 400 [FAIL→FIX] cpln apply with resources.requests/limits — 400 confirmed, doc added [FAIL→FIX] cpln secret create-opaque --payload — exit 127 confirmed, doc corrected [PASS] cpln secret create-opaque --name --file --encoding plain — created successfully [PASS] cpln identity create — created successfully [PASS] cpln apply policy with identity principalLink — created successfully [FAIL→FIX] Policy with unsorted permissions — accepted (no 400), auto-sorted by API [PASS] Workload with identityLink + cpln://secret/NAME.payload — applied and ready [PASS] Endpoint HTTP 200 after deploy [PASS] cpln logs with LogQL query — streamed correctly [PASS] cpln workload tag --tag cpln/protected=true — tagged successfully [PASS] Delete of protected workload — 400 "untag first" confirmed [PASS] cpln workload tag --remove-tag cpln/protected — flag confirmed correct via --help [PASS] cpln workload update --set + jq verification — env landed, jq path correct [PASS] cpln workload exec -- which sleep — command syntax correct, sleep present on nginx [PASS] Base64 opaque secret encoding — payload stored as base64 string, warning accurate [FAIL→REVERT] exitCode jq script — wrong source command and wrong JSON path, reverted NOT TESTED (requires marketplace install or external dependencies): [ ] .claude-plugin/plugin.json hooks field — requires Claude Code marketplace install [ ] .codex-plugin/plugin.json rules field — requires Codex [ ] GEMINI.md guardrails section — requires Gemini CLI [ ] KEDA GVC prerequisite warning — requires queue + KEDA infrastructure [ ] k8s-migrator firewall default warning — prose only, not CLI-testable Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
.claude/settings.json does not get merged into user settings on plugin install. Only the agent and subagentStatusLine keys are supported from a plugin-bundled settings.json — the hooks block is ignored by the harness. Hooks are correctly distributed via the hooks field in .claude-plugin/plugin.json, which the harness activates at runtime when the plugin is enabled. Adding .claude/ to .gitignore to prevent local session data from being committed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| ], | ||
| "mcpServers": "./.claude-mcp.json" | ||
| "mcpServers": "./.claude-mcp.json", | ||
| "hooks": "./hooks/hooks.json" |
There was a problem hiding this comment.
In the official reference for the plugin.json, it is documented that hooks/hooks.json is auto-loaded from the default location which is also hooks/hooks.json. Are you sure the hooks were ignored when you installed and used the plugin?
There was a problem hiding this comment.
When I tested with Sonnet, it wasn't loading the hooks and was attempting to call the CLI directly which caused it to ignore the explicit hooks call. I added this to the plugin.json, reloaded the plugin and ran it again. It leveraged the new hooks after that.
| @@ -137,6 +137,8 @@ Use `cpln://secret/NAME` to reference the full secret, or `cpln://secret/NAME.KE | |||
| | nats-account | `accountId`, `privateKey` | `cpln://secret/my-nats.accountId` | | |||
| | any type | omit key for full secret as JSON | `cpln://secret/my-secret` | | |||
|
|
|||
| **Opaque `.payload` encoding warning:** If the secret was created with base64 encoding (common when storing binary content — certs, keys, binary tokens — via the console or API), the `.payload` reference returns the base64-encoded string, not the decoded value. The workload receives it as a base64 string and typically fails with a cryptographic or parse error. To get the decoded value at runtime, the secret must have runtime decoding enabled (`encoding: base64` + runtime decode on the secret spec), or use the full secret reference (`cpln://secret/NAME`) and decode in application code. For plaintext secrets (API keys, connection strings, passwords), `.payload` works as expected. **Before injecting an opaque secret as `.payload`, ask the user: was this secret created with base64 encoding?** | |||
There was a problem hiding this comment.
This warning is inaccurate. Opaque .payload always delivers the original value the user stored. There is no "runtime decoding" flag. encoding: 'base64' is purely a storage setting so binary content can survive the JSON API; the backend forwards it to Kubernetes as-is and K8s decodes it back to the original bytes at injection time. Asking the user "was this created with base64 encoding?" before injecting .payload adds friction without changing what the workload sees.
Suggested replacement:
> **Opaque `.payload` reference:** `.payload` always delivers the value the user originally stored. If the secret was created with `encoding: 'base64'` (used to store binary content such as binaries, certs or keys that aren't valid UTF-8), the actuator forwards the base64 to Kubernetes as-is and Kubernetes decodes it back to the original bytes when injecting as an env var or mounting as a file, no application-side decoding required. **Caveat:** env vars on most container runtimes don't reliably carry null bytes or non-UTF-8 content, so for opaque secrets whose decoded value is binary, mount as a volume instead of injecting as an env var.
Just so you know, with encoding: 'base64':
- You store the base64 string in
payloadwith encodingbase64. - The backend forwards that base64 string into the K8s
Secret.data.payloadfield as-is (because K8s already requiresdatavalues to be base64-encoded, that's just the K8s format). - K8s decodes it once at injection time, so the workload receives the original decoded bytes (the binary the base64 represented).
Important: encoding: 'base64' is the right choice when you have binary content that you base64-encoded just to fit it into a JSON string, and you want the workload to see the binary. If you actually want the workload to see the literal base64 string itself (e.g., it's a token that happens to look like base64 and your app expects it as text), use encoding: 'plain' and store the string as plaintext, otherwise K8s will decode it and your app will get the underlying bytes instead of the string. Maybe let's consider saying that here in the AI plugin as well.
| > outboundAllowCIDR: | ||
| > - 0.0.0.0/0 # or restrict to specific CIDRs/hostnames | ||
| > ``` | ||
|
|
There was a problem hiding this comment.
Small fix: the note says cpln stack opens outbound "for all services that expose ports." The port part is wrong, that rule is for inbound. Outbound is opened for every service, except when network_mode: none.
Suggested rewrite:
cpln stackdefaults external outbound to open for every service it generates, except those withnetwork_mode: none. Native Control Plane workload manifests default external outbound to blocked.
Everything else looks good to me.
| "autoscaling" | ||
| ], | ||
| "skills": "./skills/", | ||
| "rules": "./rules/", |
There was a problem hiding this comment.
rules is not a valid property in Codex's plugin.json and is silently ignored.
There was a problem hiding this comment.
We can remove the rules call. I was hoping it would be called from the initial load as an additional reference but if it's being ignored, then we can remove it.
|
I'll make these updates and push it back. |
|
|
||
| | Error | Fix | | ||
| |:---|:---| | ||
| | `spec.containers[N].resources` present | Remove it — Control Plane does not use Kubernetes-style `resources.requests/limits`. Set `cpu` and `memory` directly on the container object: `cpu: 50m`, `memory: 128Mi`. This returns a 400 with `"resources" is not allowed`. | |
There was a problem hiding this comment.
Good, let's also mention here (and on that other similar note we have in another file) that cpln workload update can also allow a user to set CPU and memory, e.g. cpln workload update my-workload --set spec.containers.<name>.cpu=25m. You can verify and see a list of other fields that a user can change with cpln workload update --help.
|
|
||
| **Constraints:** | ||
| - Each binding's permissions must be **sorted alphabetically and unique** (validation rule). | ||
| - Each binding's permissions must be **unique**. The API auto-sorts them alphabetically — you don't need to sort manually. |
There was a problem hiding this comment.
"Duplicate permissions in the same binding will cause a validation error" is wrong. The API silently de-duplicates them on write, duplicates don't trigger an error, they just get dropped. Ordering is also auto-normalized: the API sorts the array on write regardless of input order.
Suggested replacement:
Permission ordering and duplicates don't matter, the API normalizes both. You don't need to sort permissions alphabetically in your manifests; the platform sorts them on write. Duplicate permissions within the same binding are silently de-duplicated, not rejected, the request still succeeds. (Avoid duplicates anyway for manifest clarity and clean diffs.)
Or maybe we don't mention this at all, it is not that important info to point out.
| ## Gotchas | ||
|
|
||
| - **Policies fail silently when wrong.** A typo in `targetKind`, a missing principal link, or an invalid permission name produces a policy that exists but grants nothing. Always verify with `cpln policy access-report POLICY_NAME` after creation. | ||
| - **Permission ordering doesn't matter — the API auto-sorts.** You do not need to sort permissions alphabetically in your manifests; the platform sorts them on write. Duplicate permissions in the same binding will cause a validation error. |
There was a problem hiding this comment.
Related to the comment above.
https://github.com/controlplane-com/ai-plugin/pull/1/changes#r3197980292
| keda: | ||
| enabled: true | ||
| ``` | ||
|
|
There was a problem hiding this comment.
This is correct. One small enhancement worth adding: most real KEDA triggers (SQS, Pub/Sub, Azure queues, etc.) need cloud credentials, so spec.keda usually also takes identityLink and/or secrets. The minimal enabled: true is fine as the prerequisite check, but the example could mention these for production use:
kind: gvc
name: my-gvc
spec:
keda:
enabled: true
identityLink: //identity/keda-id # required for cloud-resource triggers
secrets: # optional, for TriggerAuthentication
- //secret/queue-creds| ### Critical Warnings | ||
|
|
||
| - If `sleep` is not available in **any** container, ALL containers receive SIGKILL immediately | ||
| - If `sleep` is not available in **any** container, ALL containers receive SIGKILL immediately — the entire grace period is skipped. This silently affects distroless images, scratch-based images, and some minimal Alpine builds. Verify with `cpln workload exec WORKLOAD --gvc GVC -- which sleep` before relying on the grace period. If `sleep` is absent, either add it to the image or configure an explicit preStop hook that does not depend on it. |
There was a problem hiding this comment.
Going back over this, I think part of what I originally wrote is also off and the additions stacked on top of it, sorry about that. The "ALL containers get SIGKILL immediately, the entire grace period is skipped" framing isn't accurate: if the preStop hook fails because sleep is missing, Kubernetes still delivers SIGTERM and still honors the full terminationGracePeriodSeconds. What actually gets lost is the request-draining delay, so the load balancer may still send traffic at the moment SIGTERM arrives. preStop is per-container too, and on K8s 1.33+ Control Plane uses the native lifecycle.preStop.sleep hook with no binary dependency, so distroless/scratch are fine there, the risk only applies on older clusters. Also which sleep won't work on distroless (no which); something like /bin/sleep 0 is a better check.
That said, I don't think this is actually worth mentioning in the AI plugin. It's a narrow edge case (distroless/scratch on K8s < 1.33), the failure mode is subtle rather than catastrophic, let's remove completely.
| - `cpln gvc delete-all-workloads` — destroys every workload in the GVC | ||
| - `cpln volumeset shrink` — permanent data loss on the old volume | ||
| - Any `cpln <resource> delete` — surface the org, GVC, resource name, and blast radius before proceeding | ||
|
|
There was a problem hiding this comment.
Turns out CLAUDE.md and GEMINI.md aren't shipped to users, they're only loaded when developing on this repo. Let's remove this change, anything user-facing belongs in a skill.
There was a problem hiding this comment.
Quick correction, GEMINI.md actually does ship to users (Gemini CLI loads it every session via contextFileName), so it's not dev-only like CLAUDE.md. I think the right framing is to treat it as a guardrails file, short, always-on rules like destructive-op confirmations and CLI conventions.
…bugs
Tested by deploying a real workload (nginx, serverless) to org cpln-customer-demos / GVC ai-plugin-tesr (aws-us-east-1, aws-us-west-2), then exercising secrets, identities, policies, autoscaling config, logs, exec, and the cpln/protected tag. All tests run against the actual Control Plane API. Three bugs found and corrected; two pre-existing changes reverted after live tests proved them wrong.
== BUG 1: Kubernetes-style resources block not valid in workload manifests == Attempting to apply a workload manifest with spec.containers[N].resources.requests/limits returned a 400: '"resources" is not allowed'. Control Plane does not use the Kubernetes nested resources structure. CPU and memory are flat fields directly on the container:
cpu: 50m
memory: 128Mi
Fix: added to the "Commands / fields that don't exist" table in cli-conventions.md and to the Common Validation Errors table in workload-manifest-reference.md.
== BUG 2: cpln secret create-opaque --payload flag does not exist == cli-conventions.md listed --payload as a valid flag for create-opaque. Running
cpln secret create-opaque --payload ...returns exit 127 with the help text showing only --name and --file. The --payload name comes from the API JSON body field, which the CLI does not expose as a flag. The correct invocation is:cpln secret create-opaque --name NAME --file /path/to/value.txt --encoding plain
Confirmed via
cpln secret create-opaque --helpand live create. Fix: corrected the Required Flags column in cli-conventions.md.== BUG 3: Policy permission alphabetical sort is NOT enforced == access-control/SKILL.md and workload-troubleshooter/diagnostics.md both stated that permissions within a policy binding must be sorted alphabetically, and that submitting [view, reveal] would fail cpln apply with a validation error. Live test disproved this: applying a policy with unsorted permissions [view, reveal] succeeded (HTTP 201) and the API silently stored them sorted as [reveal, view]. The sort requirement does not exist at the API level — the platform handles ordering on write.
Fix: corrected access-control/SKILL.md gotchas section and the policy troubleshooting note in workload-troubleshooter/diagnostics.md to reflect actual behavior.
== REVERT: exitCode-based failure detection script (cpln-guardrails.md, stateful-storage/SKILL.md) == A pre-existing change replaced the original message-text polling loop with a structured exitCode check using cpln workload get. Live testing revealed two problems:
== Full test manifest ==
HOOKS (all unit tested via simulated tool inputs):
[PASS] Block cpln secret create (generic)
[PASS] Allow cpln secret create-opaque
[PASS] Block cpln apply without --file (including BSD grep -- fix)
[PASS] Allow cpln apply --file
[PASS] Block cpln gvc delete-all-workloads
[PASS] Block cpln volumeset shrink
[PASS] Block cpln list
[PASS] Allow cpln get
[PASS] Warning on cpln delete (exit 0, fires to stderr)
LIVE INFRA (org: cpln-customer-demos, gvc: ai-plugin-tesr):
[PASS] cpln apply --file with flat cpu/memory — clean apply, no 400
[FAIL→FIX] cpln apply with resources.requests/limits — 400 confirmed, doc added
[FAIL→FIX] cpln secret create-opaque --payload — exit 127 confirmed, doc corrected
[PASS] cpln secret create-opaque --name --file --encoding plain — created successfully
[PASS] cpln identity create — created successfully
[PASS] cpln apply policy with identity principalLink — created successfully
[FAIL→FIX] Policy with unsorted permissions — accepted (no 400), auto-sorted by API
[PASS] Workload with identityLink + cpln://secret/NAME.payload — applied and ready
[PASS] Endpoint HTTP 200 after deploy
[PASS] cpln logs with LogQL query — streamed correctly
[PASS] cpln workload tag --tag cpln/protected=true — tagged successfully
[PASS] Delete of protected workload — 400 "untag first" confirmed
[PASS] cpln workload tag --remove-tag cpln/protected — flag confirmed correct via --help
[PASS] cpln workload update --set + jq verification — env landed, jq path correct
[PASS] cpln workload exec -- which sleep — command syntax correct, sleep present on nginx
[PASS] Base64 opaque secret encoding — payload stored as base64 string, warning accurate
[FAIL→REVERT] exitCode jq script — wrong source command and wrong JSON path, reverted
NOT TESTED (requires marketplace install or external dependencies):
[ ] .codex-plugin/plugin.json rules field — requires Codex
[ ] GEMINI.md guardrails section — requires Gemini CLI
[ ] KEDA GVC prerequisite warning — requires queue + KEDA infrastructure
[ ] k8s-migrator firewall default warning — prose only, not CLI-testable