Conversation
dmepham
approved these changes
Apr 30, 2026
The backfill job was reusing the webhook server's in-memory SQLite store, which accumulated a ResourceTags row for every discovered resource. The pusher read them back out, sent them to the collector, and the housekeeper eventually deleted them. For the webhook server this makes sense (async HTTP arrivals, retry on failure), but for the backfill job it's pure overhead: every record was persisted, queried, updated, and deleted for no benefit. Memory grew linearly at ~0.5 KiB/resource, reaching 597 MiB at 1M resources. Functional Change: Before: The backfill job wrote every discovered resource to an in-memory SQLite database, then relied on the pusher (a periodic background goroutine) to drain and forward records to the collector. The pusher, housekeeper, and secret monitor all ran unnecessarily during backfill. After: The backfill job uses a streaming store that buffers records in a slice and sends them directly to the collector in batches of 500 via remote_write. No SQLite, no pusher, no housekeeper, no secret monitor. Memory is flat at ~53 MiB regardless of cluster size. Root Cause: The backfill job shared the webhook server's full initialization path (SQLite store, pusher, housekeeper, secret monitor, webhook controller) even though backfill only needs the webhook controller and a way to send data. The SQLite store's FindFirstBy + Create/Update per record, plus the pusher's periodic polling, were unnecessary for a batch job that processes resources in a known order and exits. Solution: 1. Extracted the pusher's format/send logic into public package-level functions in app/domain/pusher/sender.go (FormatMetrics, PushMetrics). The streaming store needs to serialize records and POST them to the collector, but can't use the MetricsPusher directly (it requires a store, starts a background goroutine, etc.). The existing private methods on MetricsPusher (formatMetrics, pushMetrics, createTimeseries, constructMetricTagName, maxTime) are replaced with one-line delegates to the new public functions — this is why pusher.go shows a large deletion, but the logic is unchanged, just moved to sender.go. 2. Created app/storage/streaming/store.go implementing types.ResourceStore. On Create(), records are appended to a batch slice. When the batch reaches 500 records, it flushes: formats to protobuf timeseries, snappy- compresses, and POSTs to the collector endpoint. FindFirstBy/FindAllBy return not-found/empty (backfill never queries the store). Tx() is a passthrough. 3. Restructured app/functions/webhook/main.go so the backfill branch comes first and only creates what it needs: streaming store, webhook controller, K8s client, enumerator. The webhook server branch keeps the full setup (SQLite store, pusher, housekeeper, secret monitor, TLS server). Validation: End-to-end test in app/storage/streaming/store_test.go: creates 2 namespaces + 1 pod in a fake clientset, runs the backfiller through the streaming store, and verifies the collector sink received the expected timeseries with correct metric names, label filtering, and metadata: - cloudzero_namespace_labels for "production" has label_environment=prod, label_team=platform, and does NOT have label_unmatched (filtered) - cloudzero_pod_labels for "web-1" has label_app=web, label_team=platform - 6 timeseries total (labels + annotations for 2 namespaces and 1 pod) Out-of-process RSS comparison (streaming store vs previous SQLite store): Resources Streaming store SQLite store (before) 11k 53 MiB 60 MiB 22k 53 MiB 66 MiB Memory is flat — records are sent and discarded in batches, not accumulated. The previous SQLite store grew linearly with resource count. All existing unit tests pass (make test). All binaries build (make build).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The backfill job was reusing the webhook server's in-memory SQLite store, which accumulated a ResourceTags row for every discovered resource. The pusher read them back out, sent them to the collector, and the housekeeper eventually deleted them. For the webhook server this makes sense (async HTTP arrivals, retry on failure), but for the backfill job it's pure overhead: every record was persisted, queried, updated, and deleted for no benefit. Memory grew linearly at ~0.5 KiB/resource, reaching 597 MiB at 1M resources.
Functional Change:
Before: The backfill job wrote every discovered resource to an in-memory SQLite database, then relied on the pusher (a periodic background goroutine) to drain and forward records to the collector. The pusher, housekeeper, and secret monitor all ran unnecessarily during backfill.
After: The backfill job uses a streaming store that buffers records in a slice and sends them directly to the collector in batches of 500 via remote_write. No SQLite, no pusher, no housekeeper, no secret monitor. Memory is flat at ~53 MiB regardless of cluster size.
Root Cause:
The backfill job shared the webhook server's full initialization path (SQLite store, pusher, housekeeper, secret monitor, webhook controller) even though backfill only needs the webhook controller and a way to send data. The SQLite store's FindFirstBy + Create/Update per record, plus the pusher's periodic polling, were unnecessary for a batch job that processes resources in a known order and exits.
Solution:
Extracted the pusher's format/send logic into public package-level functions in app/domain/pusher/sender.go (FormatMetrics, PushMetrics). The streaming store needs to serialize records and POST them to the collector, but can't use the MetricsPusher directly (it requires a store, starts a background goroutine, etc.). The existing private methods on MetricsPusher (formatMetrics, pushMetrics, createTimeseries, constructMetricTagName, maxTime) are replaced with one-line delegates to the new public functions — this is why pusher.go shows a large deletion, but the logic is unchanged, just moved to sender.go.
Created app/storage/streaming/store.go implementing types.ResourceStore. On Create(), records are appended to a batch slice. When the batch reaches 500 records, it flushes: formats to protobuf timeseries, snappy- compresses, and POSTs to the collector endpoint. FindFirstBy/FindAllBy return not-found/empty (backfill never queries the store). Tx() is a passthrough.
Restructured app/functions/webhook/main.go so the backfill branch comes first and only creates what it needs: streaming store, webhook controller, K8s client, enumerator. The webhook server branch keeps the full setup (SQLite store, pusher, housekeeper, secret monitor, TLS server).
Validation:
End-to-end test in app/storage/streaming/store_test.go: creates 2 namespaces + 1 pod in a fake clientset, runs the backfiller through the streaming store, and verifies the collector sink received the expected timeseries with correct metric names, label filtering, and metadata:
Out-of-process RSS comparison (streaming store vs previous SQLite store):
Resources Streaming store SQLite store (before)
11k 53 MiB 60 MiB
22k 53 MiB 66 MiB
Memory is flat — records are sent and discarded in batches, not accumulated. The previous SQLite store grew linearly with resource count.
All existing unit tests pass (make test). All binaries build (make build).