Skip to content

CP-28161: remove API key from webhook server#794

Draft
evan-cz wants to merge 1 commit into
developfrom
CP-28161
Draft

CP-28161: remove API key from webhook server#794
evan-cz wants to merge 1 commit into
developfrom
CP-28161

Conversation

@evan-cz
Copy link
Copy Markdown
Contributor

@evan-cz evan-cz commented Apr 28, 2026

The webhook server (cloudzero-webhook binary; previously known as the insights-controller) used to push metrics directly to the CloudZero cloud API and authenticated each request with the customer's API key. It now always pushes to the in-cluster aggregator, which is unauthenticated on the cluster-internal hop. The API key secret mount, the Authorization: Bearer header, the in-process secret monitor that polled the key file for rotation, and the legacy query-string encoding of cluster_name/cloud_account_id/region in the destination URL all became dead weight on this code path.

Functional Change:

Before: The webhook pod mounted the cloudzero-api-key secret, the rendered server-config.yaml included api_key_path:, and the pusher posted to http://aggregator/collector?cluster_name=...&cloud_account_id= ...&region=... with an Authorization: Bearer $KEY header. A secret monitor polled the API key file every 10 seconds for in-process rotation.

After: The webhook pod has no api-key volume or volumeMount, the rendered server-config.yaml has no api_key_path field, and the pusher posts to the bare collector URL with no Authorization header. The secret monitor is no longer wired into the webhook entry point. The aggregator (collector and shipper), the agent (Prometheus and validator init containers), the config-loader Job, and other components that legitimately still need the key are unchanged.

Solution:

  1. helm/templates/_cm_helpers.tpl: removed the api_key_path: line from the cloudzero-agent.insightsController.configuration helper. The aggregator's configuration helper still emits api_key_path because the shipper continues to need the key for uploads.
  2. helm/templates/webhook-deploy.yaml: removed the cloudzero-api-key volumeMount and volume blocks (both gated on existingSecretName or apiKey).
  3. app/config/webhook/settings.go: removed the unexported RemoteWrite.apiKey field, the GetAPIKey/SetAPIKey methods, the absFilePath and isValidURL helpers, and the Settings.mu mutex (it only protected the API-key read/write pair). Simplified setRemoteWriteURL() to assign Destination verbatim with a single URL validation; the aggregator's /collector handler does not parse cluster_name/cloud_account_id/region query parameters (only the shipper consumes those, on its own outbound path). The APIKeyPath field is kept as a deprecated tombstone so legacy server-config.yaml files and pod specs that still set API_KEY_PATH parse cleanly even under future strict-mode YAML/env decoders.
  4. app/domain/pusher/pusher.go: removed the GetAPIKey lookup, the empty-key error branch, the Authorization header, and the apiKey parameter from pushMetrics.
  5. app/functions/webhook/main.go: removed the monitor.NewSecretMonitor instantiation and its Run/Shutdown wiring. The monitor package is still imported for TLS reloading on SIGHUP, which is unrelated. Also updated the startup log message from "Starting CloudZero Insights Controller" to "Starting CloudZero Webhook Server" to use the current component name.
  6. app/domain/monitor/secret_monitor_test.go: replaced the test's dependency on the webhook config.Settings (which no longer implements MonitoredAPIKey) with a small in-test file-backed fake.
  7. app/config/webhook/settings_test.go and app/domain/pusher/pusher_test.go: removed API-key fixtures; the pusher test now positively asserts that no Authorization header is sent.
  8. helm/tests/webhook_no_apikey_mount_test.yaml: new helm-unittest suite asserting the rendered webhook Deployment has no cloudzero-api-key volume or volumeMount, and the rendered webhook ConfigMap contains no api_key_path field and no cluster_name/cloud_account_id query string fragments.

Validation:

  • Go unit tests pass for app/config/webhook, app/domain/pusher, and app/domain/monitor.
  • make lint-go is clean; make helm-lint is clean; make helm-test passes, including the new helm-unittest suite.
  • Generated chart manifests under tests/helm/template/ show only the intended deletions across all six variants (alloy, cert-manager, federated, istio, kubestate, manifest): one api_key_path: line in the webhook ConfigMap, one four-line volumeMount block, and one three-line volume block per variant. No other resources change.
  • KIND end-to-end: deployed the freshly built image, confirmed the webhook pod has no cloudzero-api-key volume or volumeMount, the rendered server-config.yaml has no api_key_path, and the destination URL has no query string. Exercised the admission flow with test pods and confirmed the pusher's "Pushing records to remote write endpoint" log fires with records flowing to the in-cluster aggregator.
  • Development EKS cluster end-to-end: helm upgrade rolled cleanly, webhook started with no API-key plumbing and no startup errors, admission requests exercised, "Pushing records to remote write endpoint" log confirmed the full data path through the new pusher code.

The webhook server (cloudzero-webhook binary; previously known as the
insights-controller) used to push metrics directly to the CloudZero
cloud API and authenticated each request with the customer's API key.
It now always pushes to the in-cluster aggregator, which is
unauthenticated on the cluster-internal hop. The API key secret
mount, the Authorization: Bearer header, the in-process secret monitor
that polled the key file for rotation, and the legacy query-string
encoding of cluster_name/cloud_account_id/region in the destination URL
all became dead weight on this code path.

Functional Change:

Before: The webhook pod mounted the cloudzero-api-key secret, the
rendered server-config.yaml included api_key_path:, and the pusher
posted to http://aggregator/collector?cluster_name=...&cloud_account_id=
...&region=... with an `Authorization: Bearer $KEY` header. A secret
monitor polled the API key file every 10 seconds for in-process
rotation.

After: The webhook pod has no api-key volume or volumeMount, the
rendered server-config.yaml has no api_key_path field, and the pusher
posts to the bare collector URL with no Authorization header. The
secret monitor is no longer wired into the webhook entry point. The
aggregator (collector and shipper), the agent (Prometheus and validator
init containers), the config-loader Job, and other components that
legitimately still need the key are unchanged.

Solution:

1. helm/templates/_cm_helpers.tpl: removed the api_key_path: line from
   the cloudzero-agent.insightsController.configuration helper. The
   aggregator's configuration helper still emits api_key_path because
   the shipper continues to need the key for uploads.
2. helm/templates/webhook-deploy.yaml: removed the cloudzero-api-key
   volumeMount and volume blocks (both gated on existingSecretName or
   apiKey).
3. app/config/webhook/settings.go: removed the unexported
   RemoteWrite.apiKey field, the GetAPIKey/SetAPIKey methods, the
   absFilePath and isValidURL helpers, and the Settings.mu mutex (it
   only protected the API-key read/write pair). Simplified
   setRemoteWriteURL() to assign Destination verbatim with a single URL
   validation; the aggregator's /collector handler does not parse
   cluster_name/cloud_account_id/region query parameters (only the
   shipper consumes those, on its own outbound path). The APIKeyPath
   field is kept as a deprecated tombstone so legacy server-config.yaml
   files and pod specs that still set API_KEY_PATH parse cleanly even
   under future strict-mode YAML/env decoders.
4. app/domain/pusher/pusher.go: removed the GetAPIKey lookup, the
   empty-key error branch, the Authorization header, and the apiKey
   parameter from pushMetrics.
5. app/functions/webhook/main.go: removed the monitor.NewSecretMonitor
   instantiation and its Run/Shutdown wiring. The monitor package is
   still imported for TLS reloading on SIGHUP, which is unrelated.
   Also updated the startup log message from "Starting CloudZero
   Insights Controller" to "Starting CloudZero Webhook Server" to use
   the current component name.
6. app/domain/monitor/secret_monitor_test.go: replaced the test's
   dependency on the webhook config.Settings (which no longer
   implements MonitoredAPIKey) with a small in-test file-backed fake.
7. app/config/webhook/settings_test.go and
   app/domain/pusher/pusher_test.go: removed API-key fixtures; the
   pusher test now positively asserts that no Authorization header is
   sent.
8. helm/tests/webhook_no_apikey_mount_test.yaml: new helm-unittest suite
   asserting the rendered webhook Deployment has no cloudzero-api-key
   volume or volumeMount, and the rendered webhook ConfigMap contains
   no api_key_path field and no cluster_name/cloud_account_id query
   string fragments.

Validation:

- Go unit tests pass for app/config/webhook, app/domain/pusher, and
  app/domain/monitor.
- make lint-go is clean; make helm-lint is clean; make helm-test
  passes, including the new helm-unittest suite.
- Generated chart manifests under tests/helm/template/ show only the
  intended deletions across all six variants (alloy, cert-manager,
  federated, istio, kubestate, manifest): one api_key_path: line in
  the webhook ConfigMap, one four-line volumeMount block, and one
  three-line volume block per variant. No other resources change.
- KIND end-to-end: deployed the freshly built image, confirmed the
  webhook pod has no cloudzero-api-key volume or volumeMount, the
  rendered server-config.yaml has no api_key_path, and the destination
  URL has no query string. Exercised the admission flow with test pods
  and confirmed the pusher's "Pushing records to remote write endpoint"
  log fires with records flowing to the in-cluster aggregator.
- Development EKS cluster end-to-end: helm upgrade rolled cleanly,
  webhook started with no API-key plumbing and no startup errors,
  admission requests exercised, "Pushing records to remote write
  endpoint" log confirmed the full data path through the new pusher
  code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant