Skip to content

StatsD instrumentation: row-copy, backlog, lag, throttle, cut-over, query latency, sleep#1701

Open
forge33 wants to merge 9 commits into
github:masterfrom
Shopify:shopify-statsd-metrics-upstream
Open

StatsD instrumentation: row-copy, backlog, lag, throttle, cut-over, query latency, sleep#1701
forge33 wants to merge 9 commits into
github:masterfrom
Shopify:shopify-statsd-metrics-upstream

Conversation

@forge33

@forge33 forge33 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Related issue: #1672

Description

This PR implements the bulk of the StatsD instrumentation spec from #1672, building on the already-merged StatsD client (#1689) and Go runtime metrics (#1690). It adds metrics for row-copy progress, binlog backlog, replication/heartbeat lag, throttle activity, cut-over phases, query latency, and per-stage sleeps.

These changes were developed and proven out in production on Shopify's fork of gh-ost over the last few weeks. They're being upstreamed here, reorganized into focused, commit-by-commit changes for easier review:

  1. Introduce unified Emitter interface; consolidate go_runtime helpersmetrics.Emitter (Gauge / Count / Histogram) replaces the narrower MemStatsGaugeEmitter; EmitGoRuntimeGauges + StartGoRuntimeReporter move into emit.go so the metrics package has a single helpers file. Switches MigrationContext.Metrics to metrics.Emitter so consumers are testable with spies.
  2. Refactor status reporting around migrationProgressSnapshot — sample row-copy, DML, backlog, and lag once per tick into a snapshot passed to printStatus, with reportStatus as the single entry point for status output. Prerequisite for the next several commits (so metrics emit on every tick even when status printing is suppressed).
  3. Emit row-copy progress gaugesrow_copy.rows_copied, row_copy.rows_estimate, dml.events_applied.
  4. Emit binlog backlog gaugesbinlog.backlog_size, binlog.backlog_capacity, binlog.backlog_utilization (size / cap, clamped to [0, 1]).
  5. Emit replication and heartbeat lag gaugeslag.replication_seconds, lag.heartbeat_seconds, both tagged throttled:true|false.
  6. Emit throttle metricsthrottle.active gauge (debounced to 1Hz), throttle.duration_milliseconds histogram + throttle.events_total count on each throttled-interval exit, tagged reason:.
  7. Emit cut-over phase, attempt, and total duration metricscut_over.attempts_total (tagged outcome:success|retry|abort), cut_over.phase_duration_milliseconds (tagged phase: covering magic_lock, original_table_lock, magic_rename, unlock), cut_over.total_duration_milliseconds. Atomic rename phase duration is recorded after the rename completes so the histogram reflects the full operation.
  8. Emit query latency histogramquery.duration_milliseconds, tagged side:source|target, kind:chunk_copy|range_select|binlog_apply|row_count, outcome:ok|error.
  9. Emit per-stage sleep histogramssleep.duration_milliseconds + sleep.total_milliseconds, tagged stage:cut_over_postpone|chunk_throttle|retry_backoff|replica_wait. Sub-millisecond chunk-throttle samples are skipped to avoid zero-valued histogram entries from sub-microsecond nice-ratio waits.

Notes on divergences from the original spec in #1672

A few intentional differences from the initial spec, agreed in the issue thread:

  • _milliseconds units instead of _seconds for query.duration, cut_over.phase_duration, cut_over.total_duration, sleep.duration, sleep.total, and throttle.duration — sub-second waits (replica polling at 500ms, nice-ratio chunk throttling) truncate to zero in seconds. The lag metrics keep _seconds because they're whole-seconds in operator intuition and match the existing status log line.
  • lag.* as gauges, not histograms — these are point-in-time readings sampled once per status tick, not a distribution. A gauge gives "the lag right now" which matches the Lag: 2.5s status output. DogStatsD histogram aggregation exposes count/max series that don't correspond to those values.
  • range_select tagged side:sourceCalculateNextIterationRangeEndValues lives in Applier but reads chunk boundaries from the source table.
  • No kind:heartbeat_read query metric — would fire on every throttle tick and measures the act of reading lag rather than the lag itself, which is already exposed via gh_ost.lag.*.

The issue has been updated to reflect these.

How to review

The PR is intentionally structured as a commit-by-commit story. Reviewing per commit will be much easier than reading the combined diff.

  1. Start with commit 1 (metrics: introduce unified Emitter interface…). This is the foundation — confirm the Emitter contract and that go_runtime is just being moved, not behaviour-changed.

  2. Commit 2 (Refactor status reporting around migrationProgressSnapshot) is the only non-metric commit. It's a pure refactor of printStatus to sample once per tick. Worth confirming the status output is identical for all PrintStatusRule values and that the snapshot test coverage in progress_snapshot_test.go matches the previous inline logic.

  3. Commits 3–9 each add one metric (or metric group). The pattern is the same every time:

    • A small helper in go/metrics/emit.go with a nil-safe guard.
    • Unit tests for the helper in go/metrics/emit_test.go (including a _nilSafe test).
    • Wiring at the call site (migrator / throttler / applier / inspector).
    • Integration-style test using a spy Emitter to assert the wiring fires correctly.

    Each commit builds and tests independently — git checkout <sha> && go test ./... will work at every step.

  4. Cross-check the wiring against the metric table in StatsD instrumentation #1672 (updated to match). The "Notes on divergences" below capture the four deliberate deltas.

  5. Look at the cut-over commit (7) carefully — it's the most invasive in migrator.go because cut-over has three phases (cutOver, cutOverTwoStep, atomicCutOver) and cutOverOperationWithMetrics wraps the retrier. Tests cover the retry→success and immediate-abort paths.

The full file diff stat is small (~1.2K insertions, mostly tests):

go/base/context.go                 |   2 +-
go/logic/applier.go                |  12 ++
go/logic/inspect.go                |   4 +
go/logic/migrator.go               | 164 +++++++++-----
go/logic/migrator_test.go          | 197 +++++++++++++++++
go/logic/progress_snapshot.go      |  74 +++++++
go/logic/progress_snapshot_test.go |  76 +++++++
go/logic/throttler.go              |  37 +++-
go/logic/throttler_test.go         |  51 +++++
go/metrics/client.go               |  15 ++
go/metrics/emit.go                 | 197 +++++++++++++++++
go/metrics/emit_test.go            | 429 +++++++++++++++++++++++++++++++++++++
go/metrics/go_runtime.go           |  61 ------
go/metrics/go_runtime_test.go      |  67 ------

Test plan

  • go build ./... — clean.
  • go test ./... — 212 tests pass locally, including new unit coverage for every helper and every wiring point. Each helper is nil-safe and has a _nilSafe test.
  • Manually verified the metrics end-to-end against a local DogStatsD listener while running migrations on Shopify's fork over the past month.

Some screenshots from a local grafana dashboard (I didn't take full time to configure the histogram prometheus queries properly for this, just making sure that we're emitting and the numbers look correct)
Screenshot 2026-06-02 at 9 59 10 PM
Screenshot 2026-06-02 at 9 58 34 PM
Screenshot 2026-06-02 at 9 57 42 PM
Screenshot 2026-06-02 at 9 56 09 PM
Screenshot 2026-05-28 at 2 44 17 PM
Screenshot 2026-05-27 at 10 40 40 AM

Commit structure

Per the contributing guide's request for focused changes, each commit is independently buildable and tested. Commits 2–9 are stacked dependencies and grouped here so the metrics package and its consumers land together rather than as a series of half-wired PRs.

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

forge33 and others added 9 commits June 8, 2026 15:57
…me helpers

Introduce a single Emitter interface (Gauge, Count, Histogram) on metrics.Client
so all metric helpers share one testable contract. The interface replaces the
narrow MemStatsGaugeEmitter type used only by the Go runtime reporter.

- Add Emitter interface and Histogram method on Client.
- Move go_runtime.go and go_runtime_test.go into emit.go and emit_test.go so the
  metrics package has a single place for helper functions and tests.
- Switch MigrationContext.Metrics from *metrics.Client to metrics.Emitter so
  consumers can be tested with spies.
Sample row-copy, DML, backlog, and lag once per tick into a snapshot
passed to printStatus, with reportStatus as the single entry point for
status output.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add gh_ost.row_copy.rows_copied, gh_ost.row_copy.rows_estimate, and
gh_ost.dml.events_applied gauges on each reportStatus tick, sampled from
migrationProgressSnapshot.
Add gh_ost.binlog.backlog_size, gh_ost.binlog.backlog_capacity, and
gh_ost.binlog.backlog_utilization gauges on each reportStatus tick from
the applyEventsQueue depth captured in migrationProgressSnapshot.
Add gh_ost.lag.replication_seconds and gh_ost.lag.heartbeat_seconds gauges
on each reportStatus tick, tagged with throttled:true|false. These are
point-in-time readings (not distributions), so gauges are used rather than
histograms — DogStatsD histogram aggregation exposes count/max series that
do not match the log line lag values in Prometheus/Grafana.
Record throttle active state at a debounced cadence (gh_ost.throttle.active)
and emit duration plus event metrics when a throttled interval completes
(gh_ost.throttle.duration_seconds histogram and gh_ost.throttle.events_total
count), each tagged with the throttling reason.
Add cut-over metric helpers and instrument cut-over attempts, phase
durations, and terminal duration.

Metrics emitted:
- gh_ost.cut_over.attempts_total tagged with outcome
- gh_ost.cut_over.phase_duration_milliseconds tagged with phase and outcome
- gh_ost.cut_over.total_duration_milliseconds tagged with outcome

Phase coverage includes the magic lock, original table lock, magic rename,
and unlock paths. Durations are reported in milliseconds to preserve
sub-second granularity. The atomic rename phase duration is recorded after
the rename completes so the histogram reflects the full operation.
Emit gh_ost.query.duration_milliseconds for representative source-side and
target-side queries (row count and binlog apply), tagged with side, kind,
and outcome (ok|error). Helper validates inputs and is nil-safe.
Add sleep metric helpers and instrument the main migration sleep/wait paths.

Metrics emitted:
- gh_ost.sleep.duration_milliseconds tagged with stage
- gh_ost.sleep.total_milliseconds tagged with stage

Stages covered:
- cut_over_postpone
- chunk_throttle
- retry_backoff
- replica_wait

Use millisecond units so sub-second waits, such as replica polling and
nice-ratio throttling, are not truncated to zero. Skip sub-millisecond
chunk-throttle samples to avoid emitting zero-valued sleeps that would
distort the histogram.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants