Skip to content

CP-40824: Replace SQLite store with streaming store in backfill job#803

Open
evan-cz wants to merge 1 commit into
developfrom
CP-40954
Open

CP-40824: Replace SQLite store with streaming store in backfill job#803
evan-cz wants to merge 1 commit into
developfrom
CP-40954

Conversation

@evan-cz
Copy link
Copy Markdown
Contributor

@evan-cz evan-cz commented Apr 30, 2026

The backfill job was reusing the webhook server's in-memory SQLite store, which accumulated a ResourceTags row for every discovered resource. The pusher read them back out, sent them to the collector, and the housekeeper eventually deleted them. For the webhook server this makes sense (async HTTP arrivals, retry on failure), but for the backfill job it's pure overhead: every record was persisted, queried, updated, and deleted for no benefit. Memory grew linearly at ~0.5 KiB/resource, reaching 597 MiB at 1M resources.

Functional Change:

Before: The backfill job wrote every discovered resource to an in-memory SQLite database, then relied on the pusher (a periodic background goroutine) to drain and forward records to the collector. The pusher, housekeeper, and secret monitor all ran unnecessarily during backfill.

After: The backfill job uses a streaming store that buffers records in a slice and sends them directly to the collector in batches of 500 via remote_write. No SQLite, no pusher, no housekeeper, no secret monitor. Memory is flat at ~53 MiB regardless of cluster size.

Root Cause:

The backfill job shared the webhook server's full initialization path (SQLite store, pusher, housekeeper, secret monitor, webhook controller) even though backfill only needs the webhook controller and a way to send data. The SQLite store's FindFirstBy + Create/Update per record, plus the pusher's periodic polling, were unnecessary for a batch job that processes resources in a known order and exits.

Solution:

  1. Extracted the pusher's format/send logic into public package-level functions in app/domain/pusher/sender.go (FormatMetrics, PushMetrics). The streaming store needs to serialize records and POST them to the collector, but can't use the MetricsPusher directly (it requires a store, starts a background goroutine, etc.). The existing private methods on MetricsPusher (formatMetrics, pushMetrics, createTimeseries, constructMetricTagName, maxTime) are replaced with one-line delegates to the new public functions — this is why pusher.go shows a large deletion, but the logic is unchanged, just moved to sender.go.

  2. Created app/storage/streaming/store.go implementing types.ResourceStore. On Create(), records are appended to a batch slice. When the batch reaches 500 records, it flushes: formats to protobuf timeseries, snappy- compresses, and POSTs to the collector endpoint. FindFirstBy/FindAllBy return not-found/empty (backfill never queries the store). Tx() is a passthrough.

  3. Restructured app/functions/webhook/main.go so the backfill branch comes first and only creates what it needs: streaming store, webhook controller, K8s client, enumerator. The webhook server branch keeps the full setup (SQLite store, pusher, housekeeper, secret monitor, TLS server).

Validation:

End-to-end test in app/storage/streaming/store_test.go: creates 2 namespaces + 1 pod in a fake clientset, runs the backfiller through the streaming store, and verifies the collector sink received the expected timeseries with correct metric names, label filtering, and metadata:

  • cloudzero_namespace_labels for "production" has label_environment=prod, label_team=platform, and does NOT have label_unmatched (filtered)
  • cloudzero_pod_labels for "web-1" has label_app=web, label_team=platform
  • 6 timeseries total (labels + annotations for 2 namespaces and 1 pod)

Out-of-process RSS comparison (streaming store vs previous SQLite store):

Resources Streaming store SQLite store (before)
11k 53 MiB 60 MiB
22k 53 MiB 66 MiB

Memory is flat — records are sent and discarded in batches, not accumulated. The previous SQLite store grew linearly with resource count.

All existing unit tests pass (make test). All binaries build (make build).

@evan-cz evan-cz requested a review from a team as a code owner April 30, 2026 14:24
The backfill job was reusing the webhook server's in-memory SQLite store,
which accumulated a ResourceTags row for every discovered resource. The
pusher read them back out, sent them to the collector, and the housekeeper
eventually deleted them. For the webhook server this makes sense (async
HTTP arrivals, retry on failure), but for the backfill job it's pure
overhead: every record was persisted, queried, updated, and deleted for no
benefit. Memory grew linearly at ~0.5 KiB/resource, reaching 597 MiB at
1M resources.

Functional Change:

Before: The backfill job wrote every discovered resource to an in-memory
SQLite database, then relied on the pusher (a periodic background goroutine)
to drain and forward records to the collector. The pusher, housekeeper, and
secret monitor all ran unnecessarily during backfill.

After: The backfill job uses a streaming store that buffers records in a
slice and sends them directly to the collector in batches of 500 via
remote_write. No SQLite, no pusher, no housekeeper, no secret monitor.
Memory is flat at ~53 MiB regardless of cluster size.

Root Cause:

The backfill job shared the webhook server's full initialization path
(SQLite store, pusher, housekeeper, secret monitor, webhook controller)
even though backfill only needs the webhook controller and a way to send
data. The SQLite store's FindFirstBy + Create/Update per record, plus the
pusher's periodic polling, were unnecessary for a batch job that processes
resources in a known order and exits.

Solution:

1. Extracted the pusher's format/send logic into public package-level
   functions in app/domain/pusher/sender.go (FormatMetrics, PushMetrics).
   The streaming store needs to serialize records and POST them to the
   collector, but can't use the MetricsPusher directly (it requires a
   store, starts a background goroutine, etc.). The existing private
   methods on MetricsPusher (formatMetrics, pushMetrics, createTimeseries,
   constructMetricTagName, maxTime) are replaced with one-line delegates
   to the new public functions — this is why pusher.go shows a large
   deletion, but the logic is unchanged, just moved to sender.go.

2. Created app/storage/streaming/store.go implementing types.ResourceStore.
   On Create(), records are appended to a batch slice. When the batch
   reaches 500 records, it flushes: formats to protobuf timeseries, snappy-
   compresses, and POSTs to the collector endpoint. FindFirstBy/FindAllBy
   return not-found/empty (backfill never queries the store). Tx() is a
   passthrough.

3. Restructured app/functions/webhook/main.go so the backfill branch comes
   first and only creates what it needs: streaming store, webhook
   controller, K8s client, enumerator. The webhook server branch keeps the
   full setup (SQLite store, pusher, housekeeper, secret monitor, TLS
   server).

Validation:

End-to-end test in app/storage/streaming/store_test.go: creates 2
namespaces + 1 pod in a fake clientset, runs the backfiller through the
streaming store, and verifies the collector sink received the expected
timeseries with correct metric names, label filtering, and metadata:

- cloudzero_namespace_labels for "production" has label_environment=prod,
  label_team=platform, and does NOT have label_unmatched (filtered)
- cloudzero_pod_labels for "web-1" has label_app=web, label_team=platform
- 6 timeseries total (labels + annotations for 2 namespaces and 1 pod)

Out-of-process RSS comparison (streaming store vs previous SQLite store):

Resources Streaming store SQLite store (before)
11k 53 MiB 60 MiB
22k 53 MiB 66 MiB

Memory is flat — records are sent and discarded in batches, not
accumulated. The previous SQLite store grew linearly with resource count.

All existing unit tests pass (make test). All binaries build (make build).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants