Skip to content

Live-test plugin against cpln-customer-demos and fix three confirmed …#1

Open
JerimiahCP wants to merge 2 commits into
mainfrom
feat/plugin-setup-hooks-guardrails
Open

Live-test plugin against cpln-customer-demos and fix three confirmed …#1
JerimiahCP wants to merge 2 commits into
mainfrom
feat/plugin-setup-hooks-guardrails

Conversation

@JerimiahCP
Copy link
Copy Markdown
Collaborator

…bugs

Tested by deploying a real workload (nginx, serverless) to org cpln-customer-demos / GVC ai-plugin-tesr (aws-us-east-1, aws-us-west-2), then exercising secrets, identities, policies, autoscaling config, logs, exec, and the cpln/protected tag. All tests run against the actual Control Plane API. Three bugs found and corrected; two pre-existing changes reverted after live tests proved them wrong.

== BUG 1: Kubernetes-style resources block not valid in workload manifests == Attempting to apply a workload manifest with spec.containers[N].resources.requests/limits returned a 400: '"resources" is not allowed'. Control Plane does not use the Kubernetes nested resources structure. CPU and memory are flat fields directly on the container:

cpu: 50m
memory: 128Mi

Fix: added to the "Commands / fields that don't exist" table in cli-conventions.md and to the Common Validation Errors table in workload-manifest-reference.md.

== BUG 2: cpln secret create-opaque --payload flag does not exist == cli-conventions.md listed --payload as a valid flag for create-opaque. Running cpln secret create-opaque --payload ... returns exit 127 with the help text showing only --name and --file. The --payload name comes from the API JSON body field, which the CLI does not expose as a flag. The correct invocation is:

cpln secret create-opaque --name NAME --file /path/to/value.txt --encoding plain

Confirmed via cpln secret create-opaque --help and live create. Fix: corrected the Required Flags column in cli-conventions.md.

== BUG 3: Policy permission alphabetical sort is NOT enforced == access-control/SKILL.md and workload-troubleshooter/diagnostics.md both stated that permissions within a policy binding must be sorted alphabetically, and that submitting [view, reveal] would fail cpln apply with a validation error. Live test disproved this: applying a policy with unsorted permissions [view, reveal] succeeded (HTTP 201) and the API silently stored them sorted as [reveal, view]. The sort requirement does not exist at the API level — the platform handles ordering on write.

Fix: corrected access-control/SKILL.md gotchas section and the policy troubleshooting note in workload-troubleshooter/diagnostics.md to reflect actual behavior.

== REVERT: exitCode-based failure detection script (cpln-guardrails.md, stateful-storage/SKILL.md) == A pre-existing change replaced the original message-text polling loop with a structured exitCode check using cpln workload get. Live testing revealed two problems:

  1. cpln workload get does not expose .status.versions — that path is null on the workload object. The data lives in cpln workload get-deployments under .items[].status.versions.
  2. The containers field within versions is an object keyed by container name, not an array, so array-index access fails. The exitCode change was reverted to the original message-text grep loop in both files pending a properly verified replacement.

== Full test manifest ==

HOOKS (all unit tested via simulated tool inputs):
[PASS] Block cpln secret create (generic)
[PASS] Allow cpln secret create-opaque
[PASS] Block cpln apply without --file (including BSD grep -- fix)
[PASS] Allow cpln apply --file
[PASS] Block cpln gvc delete-all-workloads
[PASS] Block cpln volumeset shrink
[PASS] Block cpln list
[PASS] Allow cpln get
[PASS] Warning on cpln delete (exit 0, fires to stderr)

LIVE INFRA (org: cpln-customer-demos, gvc: ai-plugin-tesr):
[PASS] cpln apply --file with flat cpu/memory — clean apply, no 400
[FAIL→FIX] cpln apply with resources.requests/limits — 400 confirmed, doc added
[FAIL→FIX] cpln secret create-opaque --payload — exit 127 confirmed, doc corrected
[PASS] cpln secret create-opaque --name --file --encoding plain — created successfully
[PASS] cpln identity create — created successfully
[PASS] cpln apply policy with identity principalLink — created successfully
[FAIL→FIX] Policy with unsorted permissions — accepted (no 400), auto-sorted by API
[PASS] Workload with identityLink + cpln://secret/NAME.payload — applied and ready
[PASS] Endpoint HTTP 200 after deploy
[PASS] cpln logs with LogQL query — streamed correctly
[PASS] cpln workload tag --tag cpln/protected=true — tagged successfully
[PASS] Delete of protected workload — 400 "untag first" confirmed
[PASS] cpln workload tag --remove-tag cpln/protected — flag confirmed correct via --help
[PASS] cpln workload update --set + jq verification — env landed, jq path correct
[PASS] cpln workload exec -- which sleep — command syntax correct, sleep present on nginx
[PASS] Base64 opaque secret encoding — payload stored as base64 string, warning accurate
[FAIL→REVERT] exitCode jq script — wrong source command and wrong JSON path, reverted

NOT TESTED (requires marketplace install or external dependencies):
[ ] .codex-plugin/plugin.json rules field — requires Codex
[ ] GEMINI.md guardrails section — requires Gemini CLI
[ ] KEDA GVC prerequisite warning — requires queue + KEDA infrastructure
[ ] k8s-migrator firewall default warning — prose only, not CLI-testable

JerimiahCP and others added 2 commits April 28, 2026 10:37
…bugs

Tested by deploying a real workload (nginx, serverless) to org cpln-customer-demos /
GVC ai-plugin-tesr (aws-us-east-1, aws-us-west-2), then exercising secrets, identities,
policies, autoscaling config, logs, exec, and the cpln/protected tag. All tests run
against the actual Control Plane API. Three bugs found and corrected; two pre-existing
changes reverted after live tests proved them wrong.

== BUG 1: Kubernetes-style resources block not valid in workload manifests ==
Attempting to apply a workload manifest with spec.containers[N].resources.requests/limits
returned a 400: '"resources" is not allowed'. Control Plane does not use the Kubernetes
nested resources structure. CPU and memory are flat fields directly on the container:

  cpu: 50m
  memory: 128Mi

Fix: added to the "Commands / fields that don't exist" table in cli-conventions.md and
to the Common Validation Errors table in workload-manifest-reference.md.

== BUG 2: cpln secret create-opaque --payload flag does not exist ==
cli-conventions.md listed --payload as a valid flag for create-opaque. Running
`cpln secret create-opaque --payload ...` returns exit 127 with the help text showing
only --name and --file. The --payload name comes from the API JSON body field, which
the CLI does not expose as a flag. The correct invocation is:

  cpln secret create-opaque --name NAME --file /path/to/value.txt --encoding plain

Confirmed via `cpln secret create-opaque --help` and live create.
Fix: corrected the Required Flags column in cli-conventions.md.

== BUG 3: Policy permission alphabetical sort is NOT enforced ==
access-control/SKILL.md and workload-troubleshooter/diagnostics.md both stated that
permissions within a policy binding must be sorted alphabetically, and that submitting
[view, reveal] would fail cpln apply with a validation error. Live test disproved this:
applying a policy with unsorted permissions [view, reveal] succeeded (HTTP 201) and the
API silently stored them sorted as [reveal, view]. The sort requirement does not exist
at the API level — the platform handles ordering on write.

Fix: corrected access-control/SKILL.md gotchas section and the policy troubleshooting
note in workload-troubleshooter/diagnostics.md to reflect actual behavior.

== REVERT: exitCode-based failure detection script (cpln-guardrails.md, stateful-storage/SKILL.md) ==
A pre-existing change replaced the original message-text polling loop with a structured
exitCode check using cpln workload get. Live testing revealed two problems:
1. cpln workload get does not expose .status.versions — that path is null on the workload
   object. The data lives in cpln workload get-deployments under .items[].status.versions.
2. The containers field within versions is an object keyed by container name, not an
   array, so array-index access fails.
The exitCode change was reverted to the original message-text grep loop in both files
pending a properly verified replacement.

== Full test manifest ==

  HOOKS (all unit tested via simulated tool inputs):
  [PASS] Block cpln secret create (generic)
  [PASS] Allow cpln secret create-opaque
  [PASS] Block cpln apply without --file (including BSD grep -- fix)
  [PASS] Allow cpln apply --file
  [PASS] Block cpln gvc delete-all-workloads
  [PASS] Block cpln volumeset shrink
  [PASS] Block cpln <resource> list
  [PASS] Allow cpln <resource> get
  [PASS] Warning on cpln <resource> delete (exit 0, fires to stderr)

  LIVE INFRA (org: cpln-customer-demos, gvc: ai-plugin-tesr):
  [PASS] cpln apply --file with flat cpu/memory — clean apply, no 400
  [FAIL→FIX] cpln apply with resources.requests/limits — 400 confirmed, doc added
  [FAIL→FIX] cpln secret create-opaque --payload — exit 127 confirmed, doc corrected
  [PASS] cpln secret create-opaque --name --file --encoding plain — created successfully
  [PASS] cpln identity create — created successfully
  [PASS] cpln apply policy with identity principalLink — created successfully
  [FAIL→FIX] Policy with unsorted permissions — accepted (no 400), auto-sorted by API
  [PASS] Workload with identityLink + cpln://secret/NAME.payload — applied and ready
  [PASS] Endpoint HTTP 200 after deploy
  [PASS] cpln logs with LogQL query — streamed correctly
  [PASS] cpln workload tag --tag cpln/protected=true — tagged successfully
  [PASS] Delete of protected workload — 400 "untag first" confirmed
  [PASS] cpln workload tag --remove-tag cpln/protected — flag confirmed correct via --help
  [PASS] cpln workload update --set + jq verification — env landed, jq path correct
  [PASS] cpln workload exec -- which sleep — command syntax correct, sleep present on nginx
  [PASS] Base64 opaque secret encoding — payload stored as base64 string, warning accurate
  [FAIL→REVERT] exitCode jq script — wrong source command and wrong JSON path, reverted

  NOT TESTED (requires marketplace install or external dependencies):
  [ ] .claude-plugin/plugin.json hooks field — requires Claude Code marketplace install
  [ ] .codex-plugin/plugin.json rules field — requires Codex
  [ ] GEMINI.md guardrails section — requires Gemini CLI
  [ ] KEDA GVC prerequisite warning — requires queue + KEDA infrastructure
  [ ] k8s-migrator firewall default warning — prose only, not CLI-testable

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
.claude/settings.json does not get merged into user settings on plugin install.
Only the agent and subagentStatusLine keys are supported from a plugin-bundled
settings.json — the hooks block is ignored by the harness. Hooks are correctly
distributed via the hooks field in .claude-plugin/plugin.json, which the harness
activates at runtime when the plugin is enabled.

Adding .claude/ to .gitignore to prevent local session data from being committed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
],
"mcpServers": "./.claude-mcp.json"
"mcpServers": "./.claude-mcp.json",
"hooks": "./hooks/hooks.json"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the official reference for the plugin.json, it is documented that hooks/hooks.json is auto-loaded from the default location which is also hooks/hooks.json. Are you sure the hooks were ignored when you installed and used the plugin?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tested with Sonnet, it wasn't loading the hooks and was attempting to call the CLI directly which caused it to ignore the explicit hooks call. I added this to the plugin.json, reloaded the plugin and ran it again. It leveraged the new hooks after that.

Comment on lines 130 to +140
@@ -137,6 +137,8 @@ Use `cpln://secret/NAME` to reference the full secret, or `cpln://secret/NAME.KE
| nats-account | `accountId`, `privateKey` | `cpln://secret/my-nats.accountId` |
| any type | omit key for full secret as JSON | `cpln://secret/my-secret` |

**Opaque `.payload` encoding warning:** If the secret was created with base64 encoding (common when storing binary content — certs, keys, binary tokens — via the console or API), the `.payload` reference returns the base64-encoded string, not the decoded value. The workload receives it as a base64 string and typically fails with a cryptographic or parse error. To get the decoded value at runtime, the secret must have runtime decoding enabled (`encoding: base64` + runtime decode on the secret spec), or use the full secret reference (`cpln://secret/NAME`) and decode in application code. For plaintext secrets (API keys, connection strings, passwords), `.payload` works as expected. **Before injecting an opaque secret as `.payload`, ask the user: was this secret created with base64 encoding?**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning is inaccurate. Opaque .payload always delivers the original value the user stored. There is no "runtime decoding" flag. encoding: 'base64' is purely a storage setting so binary content can survive the JSON API; the backend forwards it to Kubernetes as-is and K8s decodes it back to the original bytes at injection time. Asking the user "was this created with base64 encoding?" before injecting .payload adds friction without changing what the workload sees.

Suggested replacement:

> **Opaque `.payload` reference:** `.payload` always delivers the value the user originally stored. If the secret was created with `encoding: 'base64'` (used to store binary content such as binaries, certs or keys that aren't valid UTF-8), the actuator forwards the base64 to Kubernetes as-is and Kubernetes decodes it back to the original bytes when injecting as an env var or mounting as a file, no application-side decoding required. **Caveat:** env vars on most container runtimes don't reliably carry null bytes or non-UTF-8 content, so for opaque secrets whose decoded value is binary, mount as a volume instead of injecting as an env var.

Just so you know, with encoding: 'base64':

  1. You store the base64 string in payload with encoding base64.
  2. The backend forwards that base64 string into the K8s Secret.data.payload field as-is (because K8s already requires data values to be base64-encoded, that's just the K8s format).
  3. K8s decodes it once at injection time, so the workload receives the original decoded bytes (the binary the base64 represented).

Important: encoding: 'base64' is the right choice when you have binary content that you base64-encoded just to fit it into a JSON string, and you want the workload to see the binary. If you actually want the workload to see the literal base64 string itself (e.g., it's a token that happens to look like base64 and your app expects it as text), use encoding: 'plain' and store the string as plaintext, otherwise K8s will decode it and your app will get the underlying bytes instead of the string. Maybe let's consider saying that here in the AI plugin as well.

Comment thread agents/k8s-migrator.md
> outboundAllowCIDR:
> - 0.0.0.0/0 # or restrict to specific CIDRs/hostnames
> ```

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small fix: the note says cpln stack opens outbound "for all services that expose ports." The port part is wrong, that rule is for inbound. Outbound is opened for every service, except when network_mode: none.

Suggested rewrite:

cpln stack defaults external outbound to open for every service it generates, except those with network_mode: none. Native Control Plane workload manifests default external outbound to blocked.

Everything else looks good to me.

Comment thread .codex-plugin/plugin.json
"autoscaling"
],
"skills": "./skills/",
"rules": "./rules/",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rules is not a valid property in Codex's plugin.json and is silently ignored.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the rules call. I was hoping it would be called from the initial load as an additional reference but if it's being ignored, then we can remove it.

@JerimiahCP
Copy link
Copy Markdown
Collaborator Author

I'll make these updates and push it back.


| Error | Fix |
|:---|:---|
| `spec.containers[N].resources` present | Remove it — Control Plane does not use Kubernetes-style `resources.requests/limits`. Set `cpu` and `memory` directly on the container object: `cpu: 50m`, `memory: 128Mi`. This returns a 400 with `"resources" is not allowed`. |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, let's also mention here (and on that other similar note we have in another file) that cpln workload update can also allow a user to set CPU and memory, e.g. cpln workload update my-workload --set spec.containers.<name>.cpu=25m. You can verify and see a list of other fields that a user can change with cpln workload update --help.


**Constraints:**
- Each binding's permissions must be **sorted alphabetically and unique** (validation rule).
- Each binding's permissions must be **unique**. The API auto-sorts them alphabetically — you don't need to sort manually.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Duplicate permissions in the same binding will cause a validation error" is wrong. The API silently de-duplicates them on write, duplicates don't trigger an error, they just get dropped. Ordering is also auto-normalized: the API sorts the array on write regardless of input order.

Suggested replacement:

Permission ordering and duplicates don't matter, the API normalizes both. You don't need to sort permissions alphabetically in your manifests; the platform sorts them on write. Duplicate permissions within the same binding are silently de-duplicated, not rejected, the request still succeeds. (Avoid duplicates anyway for manifest clarity and clean diffs.)

Or maybe we don't mention this at all, it is not that important info to point out.

## Gotchas

- **Policies fail silently when wrong.** A typo in `targetKind`, a missing principal link, or an invalid permission name produces a policy that exists but grants nothing. Always verify with `cpln policy access-report POLICY_NAME` after creation.
- **Permission ordering doesn't matter — the API auto-sorts.** You do not need to sort permissions alphabetically in your manifests; the platform sorts them on write. Duplicate permissions in the same binding will cause a validation error.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keda:
enabled: true
```

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct. One small enhancement worth adding: most real KEDA triggers (SQS, Pub/Sub, Azure queues, etc.) need cloud credentials, so spec.keda usually also takes identityLink and/or secrets. The minimal enabled: true is fine as the prerequisite check, but the example could mention these for production use:

kind: gvc
name: my-gvc
spec:
  keda:
    enabled: true
    identityLink: //identity/keda-id   # required for cloud-resource triggers
    secrets:                            # optional, for TriggerAuthentication
      - //secret/queue-creds

### Critical Warnings

- If `sleep` is not available in **any** container, ALL containers receive SIGKILL immediately
- If `sleep` is not available in **any** container, ALL containers receive SIGKILL immediately — the entire grace period is skipped. This silently affects distroless images, scratch-based images, and some minimal Alpine builds. Verify with `cpln workload exec WORKLOAD --gvc GVC -- which sleep` before relying on the grace period. If `sleep` is absent, either add it to the image or configure an explicit preStop hook that does not depend on it.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back over this, I think part of what I originally wrote is also off and the additions stacked on top of it, sorry about that. The "ALL containers get SIGKILL immediately, the entire grace period is skipped" framing isn't accurate: if the preStop hook fails because sleep is missing, Kubernetes still delivers SIGTERM and still honors the full terminationGracePeriodSeconds. What actually gets lost is the request-draining delay, so the load balancer may still send traffic at the moment SIGTERM arrives. preStop is per-container too, and on K8s 1.33+ Control Plane uses the native lifecycle.preStop.sleep hook with no binary dependency, so distroless/scratch are fine there, the risk only applies on older clusters. Also which sleep won't work on distroless (no which); something like /bin/sleep 0 is a better check.

That said, I don't think this is actually worth mentioning in the AI plugin. It's a narrow edge case (distroless/scratch on K8s < 1.33), the failure mode is subtle rather than catastrophic, let's remove completely.

Comment thread GEMINI.md
- `cpln gvc delete-all-workloads` — destroys every workload in the GVC
- `cpln volumeset shrink` — permanent data loss on the old volume
- Any `cpln <resource> delete` — surface the org, GVC, resource name, and blast radius before proceeding

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out CLAUDE.md and GEMINI.md aren't shipped to users, they're only loaded when developing on this repo. Let's remove this change, anything user-facing belongs in a skill.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick correction, GEMINI.md actually does ship to users (Gemini CLI loads it every session via contextFileName), so it's not dev-only like CLAUDE.md. I think the right framing is to treat it as a guardrails file, short, always-on rules like destructive-op confirmations and CLI conventions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants