CP-40824: Replace SQLite store with streaming store in backfill job by evan-cz · Pull Request #803 · Cloudzero/cloudzero-agent

evan-cz · 2026-04-30T14:24:35Z

The backfill job was reusing the webhook server's in-memory SQLite store, which accumulated a ResourceTags row for every discovered resource. The pusher read them back out, sent them to the collector, and the housekeeper eventually deleted them. For the webhook server this makes sense (async HTTP arrivals, retry on failure), but for the backfill job it's pure overhead: every record was persisted, queried, updated, and deleted for no benefit. Memory grew linearly at ~0.5 KiB/resource, reaching 597 MiB at 1M resources.

Functional Change:

Before: The backfill job wrote every discovered resource to an in-memory SQLite database, then relied on the pusher (a periodic background goroutine) to drain and forward records to the collector. The pusher, housekeeper, and secret monitor all ran unnecessarily during backfill.

After: The backfill job uses a streaming store that buffers records in a slice and sends them directly to the collector in batches of 500 via remote_write. No SQLite, no pusher, no housekeeper, no secret monitor. Memory is flat at ~53 MiB regardless of cluster size.

Root Cause:

The backfill job shared the webhook server's full initialization path (SQLite store, pusher, housekeeper, secret monitor, webhook controller) even though backfill only needs the webhook controller and a way to send data. The SQLite store's FindFirstBy + Create/Update per record, plus the pusher's periodic polling, were unnecessary for a batch job that processes resources in a known order and exits.

Solution:

Extracted the pusher's format/send logic into public package-level functions in app/domain/pusher/sender.go (FormatMetrics, PushMetrics). The streaming store needs to serialize records and POST them to the collector, but can't use the MetricsPusher directly (it requires a store, starts a background goroutine, etc.). The existing private methods on MetricsPusher (formatMetrics, pushMetrics, createTimeseries, constructMetricTagName, maxTime) are replaced with one-line delegates to the new public functions — this is why pusher.go shows a large deletion, but the logic is unchanged, just moved to sender.go.
Created app/storage/streaming/store.go implementing types.ResourceStore. On Create(), records are appended to a batch slice. When the batch reaches 500 records, it flushes: formats to protobuf timeseries, snappy- compresses, and POSTs to the collector endpoint. FindFirstBy/FindAllBy return not-found/empty (backfill never queries the store). Tx() is a passthrough.
Restructured app/functions/webhook/main.go so the backfill branch comes first and only creates what it needs: streaming store, webhook controller, K8s client, enumerator. The webhook server branch keeps the full setup (SQLite store, pusher, housekeeper, secret monitor, TLS server).

Validation:

End-to-end test in app/storage/streaming/store_test.go: creates 2 namespaces + 1 pod in a fake clientset, runs the backfiller through the streaming store, and verifies the collector sink received the expected timeseries with correct metric names, label filtering, and metadata:

cloudzero_namespace_labels for "production" has label_environment=prod, label_team=platform, and does NOT have label_unmatched (filtered)
cloudzero_pod_labels for "web-1" has label_app=web, label_team=platform
6 timeseries total (labels + annotations for 2 namespaces and 1 pod)

Out-of-process RSS comparison (streaming store vs previous SQLite store):

Resources Streaming store SQLite store (before)
11k 53 MiB 60 MiB
22k 53 MiB 66 MiB

Memory is flat — records are sent and discarded in batches, not accumulated. The previous SQLite store grew linearly with resource count.

All existing unit tests pass (make test). All binaries build (make build).

The backfill job was reusing the webhook server's in-memory SQLite store, which accumulated a ResourceTags row for every discovered resource. The pusher read them back out, sent them to the collector, and the housekeeper eventually deleted them. For the webhook server this makes sense (async HTTP arrivals, retry on failure), but for the backfill job it's pure overhead: every record was persisted, queried, updated, and deleted for no benefit. Memory grew linearly at ~0.5 KiB/resource, reaching 597 MiB at 1M resources. Functional Change: Before: The backfill job wrote every discovered resource to an in-memory SQLite database, then relied on the pusher (a periodic background goroutine) to drain and forward records to the collector. The pusher, housekeeper, and secret monitor all ran unnecessarily during backfill. After: The backfill job uses a streaming store that buffers records in a slice and sends them directly to the collector in batches of 500 via remote_write. No SQLite, no pusher, no housekeeper, no secret monitor. Memory is flat at ~53 MiB regardless of cluster size. Root Cause: The backfill job shared the webhook server's full initialization path (SQLite store, pusher, housekeeper, secret monitor, webhook controller) even though backfill only needs the webhook controller and a way to send data. The SQLite store's FindFirstBy + Create/Update per record, plus the pusher's periodic polling, were unnecessary for a batch job that processes resources in a known order and exits. Solution: 1. Extracted the pusher's format/send logic into public package-level functions in app/domain/pusher/sender.go (FormatMetrics, PushMetrics). The streaming store needs to serialize records and POST them to the collector, but can't use the MetricsPusher directly (it requires a store, starts a background goroutine, etc.). The existing private methods on MetricsPusher (formatMetrics, pushMetrics, createTimeseries, constructMetricTagName, maxTime) are replaced with one-line delegates to the new public functions — this is why pusher.go shows a large deletion, but the logic is unchanged, just moved to sender.go. 2. Created app/storage/streaming/store.go implementing types.ResourceStore. On Create(), records are appended to a batch slice. When the batch reaches 500 records, it flushes: formats to protobuf timeseries, snappy- compresses, and POSTs to the collector endpoint. FindFirstBy/FindAllBy return not-found/empty (backfill never queries the store). Tx() is a passthrough. 3. Restructured app/functions/webhook/main.go so the backfill branch comes first and only creates what it needs: streaming store, webhook controller, K8s client, enumerator. The webhook server branch keeps the full setup (SQLite store, pusher, housekeeper, secret monitor, TLS server). Validation: End-to-end test in app/storage/streaming/store_test.go: creates 2 namespaces + 1 pod in a fake clientset, runs the backfiller through the streaming store, and verifies the collector sink received the expected timeseries with correct metric names, label filtering, and metadata: - cloudzero_namespace_labels for "production" has label_environment=prod, label_team=platform, and does NOT have label_unmatched (filtered) - cloudzero_pod_labels for "web-1" has label_app=web, label_team=platform - 6 timeseries total (labels + annotations for 2 namespaces and 1 pod) Out-of-process RSS comparison (streaming store vs previous SQLite store): Resources Streaming store SQLite store (before) 11k 53 MiB 60 MiB 22k 53 MiB 66 MiB Memory is flat — records are sent and discarded in batches, not accumulated. The previous SQLite store grew linearly with resource count. All existing unit tests pass (make test). All binaries build (make build).

evan-cz requested a review from a team as a code owner April 30, 2026 14:24

evan-cz force-pushed the CP-40954 branch from 7c35650 to 05f36f2 Compare April 30, 2026 14:30

dmepham approved these changes Apr 30, 2026

View reviewed changes

evan-cz force-pushed the CP-40954 branch from 05f36f2 to 9749fc7 Compare April 30, 2026 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CP-40824: Replace SQLite store with streaming store in backfill job#803

CP-40824: Replace SQLite store with streaming store in backfill job#803
evan-cz wants to merge 1 commit into
developfrom
CP-40954

evan-cz commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

evan-cz commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants