Skip to content

Silence autoscaler empty-prom error#388

Merged
simonsmallchua merged 2 commits into
mainfrom
work/autoscaler-promql-vector0
May 13, 2026
Merged

Silence autoscaler empty-prom error#388
simonsmallchua merged 2 commits into
mainfrom
work/autoscaler-promql-vector0

Conversation

@simonsmallchua
Copy link
Copy Markdown
Contributor

@simonsmallchua simonsmallchua commented May 12, 2026

Summary

  • Wrap the fly-autoscaler PromQL in both fly.autoscaler-worker.toml and fly.autoscaler-analysis.toml with or on() vector(0) so an empty result collapses to zero rather than logging metrics collection failed: empty prometheus result once per minute.
  • The broker gauges (bee_broker_stream_length, bee_broker_scheduled_zset_depth) are synchronous OTel Int64Gauges — LastValue aggregation only emits a sample when Record() is called inside a collect interval, so the series goes stale in Fly's managed Prometheus during idle. Grafana Cloud confirms the same gappy pattern across all three queried metrics.

Trade-off (documented inline + in CHANGELOG)

A genuine Redis outage previously produced a series gap (autoscaler holds machine count). With this change the gap collapses to 0, so the autoscaler scales to MIN=1. Acceptable because idle workers can't crawl during an outage anyway and restart cleanly once Redis recovers.

The proper fix — converting the broker gauges to async Int64ObservableGauge so they always emit at every collect — will be tracked in a follow-up issue.

Test plan

  • CI green.
  • After deploy: flyctl logs -a hover-autoscaler-worker and -a hover-autoscaler-analysis for ~30 min during idle — empty prometheus result log lines should drop to zero.
  • Sanity check: flyctl status -a hover-worker still shows 1 started machine; -a hover-analysis still shows 1 started.

View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

Summary by CodeRabbit

  • Bug Fixes

    • Resolved autoscaler logging errors when metrics data is unavailable. Missing broker metrics are now gracefully treated as zero values instead of generating "metrics collection failed" log entries.
  • Security

    • Updated PostgreSQL driver to address memory-safety concerns.
    • Resolved development dependency vulnerabilities in CLI tooling.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

The PR fixes autoscaler Prometheus gauge staleness by appending or on() vector(0) to empty result queries, updates three security-sensitive dependencies, and documents both changes in the changelog.

Changes

Security and autoscaler metrics fix

Layer / File(s) Summary
Autoscaler Prometheus empty-result handling
fly.autoscaler-analysis.toml, fly.autoscaler-worker.toml, CHANGELOG.md
FAS_PROMETHEUS_QUERY in both autoscaler configs now appends or on() vector(0) to convert empty Prometheus results to zero, eliminating "empty prometheus result" log errors. The fix is documented in the Fixed section of the changelog.
Security dependency version bumps
go.mod, webflow-designer-extension-cli/package.json, CHANGELOG.md
pgx/v5 is bumped from v5.7.6 to v5.9.2 for a memory-safety fix, @webflow/webflow-cli is upgraded from ^1.12.4 to ^1.21.0 for transitive vulnerability remediation, and the indirect golang.org/x/crypto v0.50.0 is removed. All changes are recorded in the new Security section of the changelog.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 Prometheus queries now return a zero,
No more empty logs to make us a hero,
Dependencies patched for safety so sound,
Staleness and vulnerabilities bound!
A quiet fix that scales our delight.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Silence autoscaler empty-prom error' directly summarizes the main change: fixing the log noise from empty Prometheus results in the autoscaler by wrapping PromQL queries with 'or on() vector(0)'.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch work/autoscaler-promql-vector0
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch work/autoscaler-promql-vector0

Comment @coderabbitai help to get the list of available commands and usage tips.

@supabase
Copy link
Copy Markdown

supabase Bot commented May 12, 2026

Updates to Preview Branch (work/autoscaler-promql-vector0) ↗︎

Deployments Status Updated
Database Tue, 12 May 2026 10:40:12 UTC
Services Tue, 12 May 2026 10:40:12 UTC
APIs Tue, 12 May 2026 10:40:12 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks Status Updated
Configurations Tue, 12 May 2026 10:40:23 UTC
Migrations Tue, 12 May 2026 10:41:13 UTC
Seeding Tue, 12 May 2026 10:41:22 UTC
Edge Functions Tue, 12 May 2026 10:41:22 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

@github-actions
Copy link
Copy Markdown
Contributor

Release Versions

App patch: v0.34.14v0.34.15

Changelog

Fixed

  • fly-autoscaler no longer logs
    metrics collection failed: empty prometheus result once a minute on both
    hover-autoscaler-worker and hover-autoscaler-analysis. The broker gauges
    (bee_broker_stream_length, bee_broker_scheduled_zset_depth) are
    synchronous OTel Int64Gauges, which only emit when Record() lands inside a
    collect interval; during idle the series goes stale in Fly's managed
    Prometheus and the autoscaler's PromQL returns no result. The autoscaler
    queries now wrap with or on() vector(0) so an empty result collapses to zero
    rather than erroring. Scaling behaviour is unchanged at idle (the existing
    max(1, …) floor already kept a single machine running). Trade-off documented
    inline: a true Redis outage now reads 0 instead of producing a series gap,
    so the autoscaler scales to MIN=1 rather than holding count — acceptable
    because idle workers can't crawl during an outage anyway and restart cleanly
    once Redis recovers. The full fix (async observable gauges) is tracked in a
    follow-up issue.

Security

  • Bump github.com/jackc/pgx/v5 from v5.7.6 to v5.9.2 to resolve a
    memory-safety vulnerability (Dependabot alert chore/Add Codecov static analysis configuration #54).
  • Bump @webflow/webflow-cli from ^1.12.4 to ^1.21.0 in
    webflow-designer-extension-cli/ to clear transitive dev-dep vulnerabilities
    (axios, follow-redirects, fast-uri, babel, postcss). Webflow extension is
    dev-only tooling and does not ship to production.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
fly.autoscaler-worker.toml (1)

35-38: Add a dedicated outage signal now that gaps are masked.

Because Line 35–Line 38 intentionally converts outage-like gaps into 0, consider alerting on a direct Redis health metric (or broker probe heartbeat) so outages remain immediately visible independent of backlog maths.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fly.autoscaler-worker.toml` around lines 35 - 38, The backlog masking
currently converts outage-like gaps to 0 which hides Redis outages; add a
dedicated outage signal by instrumenting a direct Redis/broker health metric
(e.g., export a redis_up gauge or a broker_probe_heartbeat TTL counter)
alongside the existing autoscaler backlog logic, then add an alerting rule that
fires when redis_up == 0 or broker_probe_heartbeat has not been updated for X
seconds; keep the existing gap-to-0 behavior in the autoscaler (the "gap
masking" logic) but ensure monitoring/alerting uses the new
redis_up/broker_probe_heartbeat metric so outages remain immediately visible and
actionable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@fly.autoscaler-worker.toml`:
- Around line 35-38: The backlog masking currently converts outage-like gaps to
0 which hides Redis outages; add a dedicated outage signal by instrumenting a
direct Redis/broker health metric (e.g., export a redis_up gauge or a
broker_probe_heartbeat TTL counter) alongside the existing autoscaler backlog
logic, then add an alerting rule that fires when redis_up == 0 or
broker_probe_heartbeat has not been updated for X seconds; keep the existing
gap-to-0 behavior in the autoscaler (the "gap masking" logic) but ensure
monitoring/alerting uses the new redis_up/broker_probe_heartbeat metric so
outages remain immediately visible and actionable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 223c7f38-d49c-4908-aa13-663fd2f24d2d

📥 Commits

Reviewing files that changed from the base of the PR and between cf08a89 and 4442ed3.

⛔ Files ignored due to path filters (2)
  • go.sum is excluded by !**/*.sum
  • webflow-designer-extension-cli/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (5)
  • CHANGELOG.md
  • fly.autoscaler-analysis.toml
  • fly.autoscaler-worker.toml
  • go.mod
  • webflow-designer-extension-cli/package.json

@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-388.fly.dev
Dashboard: https://hover-pr-388.fly.dev/dashboard

@simonsmallchua simonsmallchua merged commit c6e735f into main May 13, 2026
21 checks passed
@simonsmallchua simonsmallchua deleted the work/autoscaler-promql-vector0 branch May 13, 2026 02:22
simonsmallchua added a commit that referenced this pull request May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant