feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396
feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396amir-deris wants to merge 11 commits intomainfrom
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3396 +/- ##
==========================================
+ Coverage 59.10% 59.12% +0.02%
==========================================
Files 2101 2103 +2
Lines 173195 173907 +712
==========================================
+ Hits 102360 102821 +461
- Misses 61955 62188 +233
- Partials 8880 8898 +18
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
bdchatham
left a comment
There was a problem hiding this comment.
Migration shape is right and dual-emit is sound. Three blockers worth fixing before merge — all silent (no runtime error, just wrong data):
- Counter and histogram names will double-suffix under the Prometheus exporter (
_total_total,_seconds_seconds). app_tx_count_totaldouble-counts every tx via the unlabeled + labeledAddpair.txGasUsed/txGasWantedasInt64Gaugeis last-write-wins under OCC concurrency.
Plus should-fixes around bucket density at the p99=2.5s SLO threshold, the lazy sync.OnceValue init in app/ante/, and the raw proposer label encoding. Inline below.
Dashboard rewrites land alongside the PLT-327 removal step; existing legacy refs in clusters/prod/monitoring/grafana-dashboards-protocol.yaml are summary-typed ({quantile="0.5"}), so query rewrites are histogram_quantile-based, not 1:1. No alert/recording rules touch these names — the dual-emit window is sufficient buffer.
| // InitAnteMetrics registers all OTel instruments for the ante package. | ||
| // Safe to call concurrently; instruments are registered exactly once. | ||
| func InitAnteMetrics() { | ||
| appAnteMetrics.mu.Lock() |
| var millisecondBuckets = metric.WithExplicitBucketBoundaries( | ||
| 0.000025, 0.000050, 0.0001, 0.0005, 0.001, 0.0025, 0.005, 0.010, 0.020, 0.050, 0.075, 0.1, | ||
| ) |
There was a problem hiding this comment.
nit: I'd add 0.25, 0.5, 1.0 — commitDuration is on this set, and pebble compaction stalls during heavy write load can push commit past 100ms (verified by walking through BaseApp.Commit → cms.Commit(true) — the foreground write blocks during L0→L1 stalls). 100ms ceiling buckets that exactly when you most want tail visibility.
Summary
Migrates telemetry in the
appandapp/antepackages from the legacytelemetry/utilmetricshelpers to the standardized OpenTelemetry Meter API, following the same pattern established in #3265 forevmrpc.app/metrics.gowith a struct-based OTel instrument set registered viaotel.Meter("app"), initialized once inNewAppafterSetupOtelMetricsProvider()app/ante/metrics.gowith a lazily-initializedanteMetricsstruct viasync.OnceValue(fires on first CheckTx, after the global MeterProvider is set)telemetry.MeasureSince/utilmetricscalls inabci.goandinvariance.gowith dual-emitted OTel instruments; legacy calls remain withTODO(PLT-327)markers until dashboards are migratedctx/ctx.Context()through all.Record()and.Add()call sites to support OTel's context-based propagationGaugeSeidVersionAndCommitcall with anapp_build_infoobservable gauge that fires on Prometheus scrape — no per-block overheadNew metrics (OTel naming convention, exported via the process-wide MeterProvider)
ABCI phase durations — histograms bucketed at SLO thresholds (p50 ≤ 500ms, p95 ≤ 1.5s, p99 ≤ 2.5s):
app_abci_begin_block_duration_secondsapp_abci_end_block_duration_secondsapp_abci_module_end_block_duration_secondsapp_abci_check_tx_duration_secondsapp_abci_deliver_tx_duration_secondsapp_abci_deliver_batch_tx_duration_secondsapp_block_process_duration_seconds— block tx processing duration by execution typeTransaction counters and gas:
app_tx_count_totalapp_tx_process_type_total— by execution type labelapp_tx_gas_total— cumulative gas by type labelapp_tx_gas_used/app_tx_gas_wanted— per-tx gaugesApp-level flow counters:
app_optimistic_processing_total— cache hit (enabled=true) vs missapp_failed_total_gas_wanted_check_totalapp_giga_fallback_to_v2_totalLight invariance:
app_lightinvariance_supply_duration_secondsapp_lightinvariance_supply_invalid_key_totalapp_lightinvariance_supply_unmarshal_failure_totalBuild info:
app_build_info— observable gauge, always 1, labels:seid_version,commitAnte:
app_pending_nonce_total— pending nonce events by type (added, expired, rejected, accepted)Migration note
Legacy metrics (
telemetry.MeasureSince,utilmetrics.MeasureDeliverTxDuration, etc.) are dual-emitted during the migration window so existing dashboards continue to receive data. Each legacy call site is annotatedTODO(PLT-327)and will be removed once dashboards are verified against the new OTel series.