StatsD instrumentation: row-copy, backlog, lag, throttle, cut-over, query latency, sleep#1701
Open
forge33 wants to merge 9 commits into
Open
StatsD instrumentation: row-copy, backlog, lag, throttle, cut-over, query latency, sleep#1701forge33 wants to merge 9 commits into
forge33 wants to merge 9 commits into
Conversation
…me helpers Introduce a single Emitter interface (Gauge, Count, Histogram) on metrics.Client so all metric helpers share one testable contract. The interface replaces the narrow MemStatsGaugeEmitter type used only by the Go runtime reporter. - Add Emitter interface and Histogram method on Client. - Move go_runtime.go and go_runtime_test.go into emit.go and emit_test.go so the metrics package has a single place for helper functions and tests. - Switch MigrationContext.Metrics from *metrics.Client to metrics.Emitter so consumers can be tested with spies.
Sample row-copy, DML, backlog, and lag once per tick into a snapshot passed to printStatus, with reportStatus as the single entry point for status output. Co-authored-by: Cursor <cursoragent@cursor.com>
Add gh_ost.row_copy.rows_copied, gh_ost.row_copy.rows_estimate, and gh_ost.dml.events_applied gauges on each reportStatus tick, sampled from migrationProgressSnapshot.
Add gh_ost.binlog.backlog_size, gh_ost.binlog.backlog_capacity, and gh_ost.binlog.backlog_utilization gauges on each reportStatus tick from the applyEventsQueue depth captured in migrationProgressSnapshot.
Add gh_ost.lag.replication_seconds and gh_ost.lag.heartbeat_seconds gauges on each reportStatus tick, tagged with throttled:true|false. These are point-in-time readings (not distributions), so gauges are used rather than histograms — DogStatsD histogram aggregation exposes count/max series that do not match the log line lag values in Prometheus/Grafana.
Record throttle active state at a debounced cadence (gh_ost.throttle.active) and emit duration plus event metrics when a throttled interval completes (gh_ost.throttle.duration_seconds histogram and gh_ost.throttle.events_total count), each tagged with the throttling reason.
Add cut-over metric helpers and instrument cut-over attempts, phase durations, and terminal duration. Metrics emitted: - gh_ost.cut_over.attempts_total tagged with outcome - gh_ost.cut_over.phase_duration_milliseconds tagged with phase and outcome - gh_ost.cut_over.total_duration_milliseconds tagged with outcome Phase coverage includes the magic lock, original table lock, magic rename, and unlock paths. Durations are reported in milliseconds to preserve sub-second granularity. The atomic rename phase duration is recorded after the rename completes so the histogram reflects the full operation.
Emit gh_ost.query.duration_milliseconds for representative source-side and target-side queries (row count and binlog apply), tagged with side, kind, and outcome (ok|error). Helper validates inputs and is nil-safe.
Add sleep metric helpers and instrument the main migration sleep/wait paths. Metrics emitted: - gh_ost.sleep.duration_milliseconds tagged with stage - gh_ost.sleep.total_milliseconds tagged with stage Stages covered: - cut_over_postpone - chunk_throttle - retry_backoff - replica_wait Use millisecond units so sub-second waits, such as replica polling and nice-ratio throttling, are not truncated to zero. Skip sub-millisecond chunk-throttle samples to avoid emitting zero-valued sleeps that would distort the histogram.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related issue: #1672
Description
This PR implements the bulk of the StatsD instrumentation spec from #1672, building on the already-merged StatsD client (#1689) and Go runtime metrics (#1690). It adds metrics for row-copy progress, binlog backlog, replication/heartbeat lag, throttle activity, cut-over phases, query latency, and per-stage sleeps.
These changes were developed and proven out in production on Shopify's fork of gh-ost over the last few weeks. They're being upstreamed here, reorganized into focused, commit-by-commit changes for easier review:
Emitterinterface; consolidate go_runtime helpers —metrics.Emitter(Gauge / Count / Histogram) replaces the narrowerMemStatsGaugeEmitter;EmitGoRuntimeGauges+StartGoRuntimeReportermove intoemit.goso the metrics package has a single helpers file. SwitchesMigrationContext.Metricstometrics.Emitterso consumers are testable with spies.migrationProgressSnapshot— sample row-copy, DML, backlog, and lag once per tick into a snapshot passed toprintStatus, withreportStatusas the single entry point for status output. Prerequisite for the next several commits (so metrics emit on every tick even when status printing is suppressed).row_copy.rows_copied,row_copy.rows_estimate,dml.events_applied.binlog.backlog_size,binlog.backlog_capacity,binlog.backlog_utilization(size / cap, clamped to [0, 1]).lag.replication_seconds,lag.heartbeat_seconds, both taggedthrottled:true|false.throttle.activegauge (debounced to 1Hz),throttle.duration_millisecondshistogram +throttle.events_totalcount on each throttled-interval exit, taggedreason:.cut_over.attempts_total(taggedoutcome:success|retry|abort),cut_over.phase_duration_milliseconds(taggedphase:covering magic_lock, original_table_lock, magic_rename, unlock),cut_over.total_duration_milliseconds. Atomic rename phase duration is recorded after the rename completes so the histogram reflects the full operation.query.duration_milliseconds, taggedside:source|target,kind:chunk_copy|range_select|binlog_apply|row_count,outcome:ok|error.sleep.duration_milliseconds+sleep.total_milliseconds, taggedstage:cut_over_postpone|chunk_throttle|retry_backoff|replica_wait. Sub-millisecond chunk-throttle samples are skipped to avoid zero-valued histogram entries from sub-microsecond nice-ratio waits.Notes on divergences from the original spec in #1672
A few intentional differences from the initial spec, agreed in the issue thread:
_millisecondsunits instead of_secondsforquery.duration,cut_over.phase_duration,cut_over.total_duration,sleep.duration,sleep.total, andthrottle.duration— sub-second waits (replica polling at 500ms, nice-ratio chunk throttling) truncate to zero in seconds. The lag metrics keep_secondsbecause they're whole-seconds in operator intuition and match the existing status log line.lag.*as gauges, not histograms — these are point-in-time readings sampled once per status tick, not a distribution. A gauge gives "the lag right now" which matches theLag: 2.5sstatus output. DogStatsD histogram aggregation exposes count/max series that don't correspond to those values.range_selecttaggedside:source—CalculateNextIterationRangeEndValueslives inApplierbut reads chunk boundaries from the source table.kind:heartbeat_readquery metric — would fire on every throttle tick and measures the act of reading lag rather than the lag itself, which is already exposed viagh_ost.lag.*.The issue has been updated to reflect these.
How to review
The PR is intentionally structured as a commit-by-commit story. Reviewing per commit will be much easier than reading the combined diff.
Start with commit 1 (
metrics: introduce unified Emitter interface…). This is the foundation — confirm theEmittercontract and that go_runtime is just being moved, not behaviour-changed.Commit 2 (
Refactor status reporting around migrationProgressSnapshot) is the only non-metric commit. It's a pure refactor ofprintStatusto sample once per tick. Worth confirming the status output is identical for allPrintStatusRulevalues and that the snapshot test coverage inprogress_snapshot_test.gomatches the previous inline logic.Commits 3–9 each add one metric (or metric group). The pattern is the same every time:
go/metrics/emit.gowith a nil-safe guard.go/metrics/emit_test.go(including a_nilSafetest).Emitterto assert the wiring fires correctly.Each commit builds and tests independently —
git checkout <sha> && go test ./...will work at every step.Cross-check the wiring against the metric table in StatsD instrumentation #1672 (updated to match). The "Notes on divergences" below capture the four deliberate deltas.
Look at the cut-over commit (7) carefully — it's the most invasive in
migrator.gobecause cut-over has three phases (cutOver,cutOverTwoStep,atomicCutOver) andcutOverOperationWithMetricswraps the retrier. Tests cover the retry→success and immediate-abort paths.The full file diff stat is small (~1.2K insertions, mostly tests):
Test plan
go build ./...— clean.go test ./...— 212 tests pass locally, including new unit coverage for every helper and every wiring point. Each helper is nil-safe and has a_nilSafetest.Some screenshots from a local grafana dashboard (I didn't take full time to configure the histogram prometheus queries properly for this, just making sure that we're emitting and the numbers look correct)






Commit structure
Per the contributing guide's request for focused changes, each commit is independently buildable and tested. Commits 2–9 are stacked dependencies and grouped here so the metrics package and its consumers land together rather than as a series of half-wired PRs.
script/cibuildreturns with no formatting errors, build errors or unit test errors.