Skip to content

feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396

Open
amir-deris wants to merge 11 commits intomainfrom
amir/plt-323-migrate-app-metrics-to-otel-meter-api
Open

feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396
amir-deris wants to merge 11 commits intomainfrom
amir/plt-323-migrate-app-metrics-to-otel-meter-api

Conversation

@amir-deris
Copy link
Copy Markdown
Contributor

@amir-deris amir-deris commented May 5, 2026

Summary

Migrates telemetry in the app and app/ante packages from the legacy telemetry/utilmetrics helpers to the standardized OpenTelemetry Meter API, following the same pattern established in #3265 for evmrpc.

  • Adds app/metrics.go with a struct-based OTel instrument set registered via otel.Meter("app"), initialized once in NewApp after SetupOtelMetricsProvider()
  • Adds app/ante/metrics.go with a lazily-initialized anteMetrics struct via sync.OnceValue (fires on first CheckTx, after the global MeterProvider is set)
  • Replaces telemetry.MeasureSince / utilmetrics calls in abci.go and invariance.go with dual-emitted OTel instruments; legacy calls remain with TODO(PLT-327) markers until dashboards are migrated
  • Threads ctx/ctx.Context() through all .Record() and .Add() call sites to support OTel's context-based propagation
  • Replaces the per-BeginBlock GaugeSeidVersionAndCommit call with an app_build_info observable gauge that fires on Prometheus scrape — no per-block overhead

New metrics (OTel naming convention, exported via the process-wide MeterProvider)

ABCI phase durations — histograms bucketed at SLO thresholds (p50 ≤ 500ms, p95 ≤ 1.5s, p99 ≤ 2.5s):

  • app_abci_begin_block_duration_seconds
  • app_abci_end_block_duration_seconds
  • app_abci_module_end_block_duration_seconds
  • app_abci_check_tx_duration_seconds
  • app_abci_deliver_tx_duration_seconds
  • app_abci_deliver_batch_tx_duration_seconds
  • app_block_process_duration_seconds — block tx processing duration by execution type

Transaction counters and gas:

  • app_tx_count_total
  • app_tx_process_type_total — by execution type label
  • app_tx_gas_total — cumulative gas by type label
  • app_tx_gas_used / app_tx_gas_wanted — per-tx gauges

App-level flow counters:

  • app_optimistic_processing_total — cache hit (enabled=true) vs miss
  • app_failed_total_gas_wanted_check_total
  • app_giga_fallback_to_v2_total

Light invariance:

  • app_lightinvariance_supply_duration_seconds
  • app_lightinvariance_supply_invalid_key_total
  • app_lightinvariance_supply_unmarshal_failure_total

Build info:

  • app_build_info — observable gauge, always 1, labels: seid_version, commit

Ante:

  • app_pending_nonce_total — pending nonce events by type (added, expired, rejected, accepted)

Migration note

Legacy metrics (telemetry.MeasureSince, utilmetrics.MeasureDeliverTxDuration, etc.) are dual-emitted during the migration window so existing dashboards continue to receive data. Each legacy call site is annotated TODO(PLT-327) and will be removed once dashboards are verified against the new OTel series.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMay 7, 2026, 11:53 PM

@amir-deris amir-deris changed the title Added otel metrics for app and app/ante feat(app): migrate app and ante telemetry to OpenTelemetry Meter API May 6, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 72.32143% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.12%. Comparing base (8310d93) to head (7739026).
⚠️ Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
app/invariance.go 30.76% 18 Missing ⚠️
app/app.go 62.85% 13 Missing ⚠️
app/metrics.go 89.62% 9 Missing and 2 partials ⚠️
app/abci.go 75.00% 10 Missing ⚠️
app/ante/evm_checktx.go 0.00% 8 Missing ⚠️
app/ante/metrics.go 77.77% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3396      +/-   ##
==========================================
+ Coverage   59.10%   59.12%   +0.02%     
==========================================
  Files        2101     2103       +2     
  Lines      173195   173907     +712     
==========================================
+ Hits       102360   102821     +461     
- Misses      61955    62188     +233     
- Partials     8880     8898      +18     
Flag Coverage Δ
sei-chain-pr 50.91% <72.32%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
app/ante/metrics.go 77.77% <77.77%> (ø)
app/ante/evm_checktx.go 34.72% <0.00%> (-0.66%) ⬇️
app/abci.go 63.19% <75.00%> (+0.69%) ⬆️
app/metrics.go 89.62% <89.62%> (ø)
app/app.go 65.86% <62.85%> (-3.99%) ⬇️
app/invariance.go 55.81% <30.76%> (+14.77%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@bdchatham bdchatham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration shape is right and dual-emit is sound. Three blockers worth fixing before merge — all silent (no runtime error, just wrong data):

  1. Counter and histogram names will double-suffix under the Prometheus exporter (_total_total, _seconds_seconds).
  2. app_tx_count_total double-counts every tx via the unlabeled + labeled Add pair.
  3. txGasUsed / txGasWanted as Int64Gauge is last-write-wins under OCC concurrency.

Plus should-fixes around bucket density at the p99=2.5s SLO threshold, the lazy sync.OnceValue init in app/ante/, and the raw proposer label encoding. Inline below.

Dashboard rewrites land alongside the PLT-327 removal step; existing legacy refs in clusters/prod/monitoring/grafana-dashboards-protocol.yaml are summary-typed ({quantile="0.5"}), so query rewrites are histogram_quantile-based, not 1:1. No alert/recording rules touch these names — the dual-emit window is sufficient buffer.

Comment thread app/metrics.go Outdated
Comment thread app/metrics.go Outdated
Comment thread app/metrics.go Outdated
Comment thread app/abci.go Outdated
Comment thread app/ante/metrics.go Outdated
Comment thread app/app.go Outdated
Comment thread app/ante/metrics.go Outdated
// InitAnteMetrics registers all OTel instruments for the ante package.
// Safe to call concurrently; instruments are registered exactly once.
func InitAnteMetrics() {
appAnteMetrics.mu.Lock()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use sync.Once instead?

@amir-deris amir-deris requested a review from masih May 7, 2026 23:12
Comment thread app/metrics.go
Comment on lines +24 to 26
var millisecondBuckets = metric.WithExplicitBucketBoundaries(
0.000025, 0.000050, 0.0001, 0.0005, 0.001, 0.0025, 0.005, 0.010, 0.020, 0.050, 0.075, 0.1,
)
Copy link
Copy Markdown
Contributor

@bdchatham bdchatham May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd add 0.25, 0.5, 1.0commitDuration is on this set, and pebble compaction stalls during heavy write load can push commit past 100ms (verified by walking through BaseApp.Commitcms.Commit(true) — the foreground write blocks during L0→L1 stalls). 100ms ceiling buckets that exactly when you most want tail visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants