Skip to content

chromium-launcher: instrument restart phases to localise slow boots#230

Open
hiroTamada wants to merge 1 commit intomainfrom
hiro/chromium-restart-instrumentation
Open

chromium-launcher: instrument restart phases to localise slow boots#230
hiroTamada wants to merge 1 commit intomainfrom
hiro/chromium-restart-instrumentation

Conversation

@hiroTamada
Copy link
Copy Markdown
Contributor

@hiroTamada hiroTamada commented Apr 27, 2026

Why

When chromium is restarted (e.g. after a flag/policy change or a profile load), callers need DevTools to come back up within a tight budget. In our downstream production deployment we see that some restarts take ~6 s and exceed that budget — causing user-visible failures — while others complete in ~1 s on the very same fleet. Restarts are clearly bimodal, but right now we cannot tell which phase eats the time on the slow ones.

The plausible suspects are:

  • supervisorctl restart chromium itself (SIGKILL + spawn) being slow on a degraded host.
  • chromium-launcher's killExistingChromium (pkill -9 + pgrep wait) or waitForPort. We disable SO_REUSEADDR, so any lingering FDs / TIME_WAIT entries on the devtools port force us to wait for kernel cleanup — up to the 5 s waitForPort deadline.
  • Chromium's internal init after syscall.Exec (cache warmup, profile work) running slowly post-suspend / post-standby-restore.

This PR adds the instrumentation we need to tell those apart.

What

Two structured events. Both are correlatable downstream by host / VM identity, so they can be joined with each other and with caller-side CDP-readiness timings to get an end-to-end phase breakdown.

1. chromium_launcher_phase_complete (in chromium-launcher)

Emitted as slog JSON to stderr right before syscall.Exec. Captures:

  • Phase timings: pkill_elapsed_ms, pgrep_wait_elapsed_ms, pgrep_attempts, pgrep_timed_out, waitport_elapsed_ms, waitport_attempts, waitport_timed_out, waitport_last_err, launcher_total_ms.
  • Devtools-port socket state parsed from /proc/net/tcp: ss_estab_devtools, ss_timewait_devtools, ss_listen_devtools — directly tests the lingering-FD / TIME_WAIT hypothesis.
  • VM state: vm_uptime_ms, load_avg_1m, mem_available_kb, mem_cached_kb, mem_dirty_kb — directly tests the post-standby-restore / host-pressure hypothesis.
  • instance_name (read from INSTANCE_NAME env var if set) for downstream attribution.

killExistingChromium and waitForPort are refactored to return small phase-timing structs (`killStats`, `waitportStats`); behavior is unchanged.

2. chromium_restart_complete (in `server/cmd/api/api/chromium.go`)

Emitted from `restartChromiumAndWait` after a successful restart. Decomposes the wall clock into:

  • `supervisorctl_ms` — time for `supervisorctl restart chromium` to return.
  • `post_supervisorctl_ms` — gap between supervisorctl returning and a new DevTools URL appearing in the supervisord log tail.
  • `total_ms`, `operation`, `first_devtools_url_len`.

This separates "supervisord was slow to SIGKILL+respawn" from "chromium itself was slow to initialise."

Combined with the launcher event we get: `supervisorctl_ms` → `launcher_total_ms` (with sub-phases) → chromium internal init → DevTools URL written.

`/proc` parsers

New helpers `readVMUptimeMs`, `readLoadAvg1m`, `parseMemInfo`, `countSocketStates` are all fail-soft (return zero on missing files / parse errors) and have unit tests in `main_test.go` with realistic `/proc/net/tcp` and `/proc/meminfo` fixtures (using testify `assert` / `require`).

Test plan

  • `go build ./...` (server)
  • `go vet ./...`
  • `go test ./cmd/chromium-launcher/...` — passes (includes new fixture-based tests)
  • `go test ./cmd/api/...` — passes
  • `go test ./lib/devtoolsproxy/...` `./lib/chromiumflags/...` — passes
  • After deploy, verify both events show up downstream with expected fields.
  • Pivot `chromium_restart_complete.total_ms` by host to confirm the bimodal distribution and check correlation with `vm_uptime_ms` / `ss_timewait_devtools` from the launcher event.

Notes

  • Pure observability — no behavior changes to the restart path. Existing call sites of `killExistingChromium()` and `waitForPort(...)` were updated to discard the new return values where appropriate; nothing in main `exec`'s shape changed.
  • The new event is written to stderr where the existing diagnostic prints already go, so it'll be picked up by the same log pipeline as everything else `chromium-launcher` emits.

cc @

Made with Cursor


Note

Medium Risk
Touches the Chromium restart/wait control flow and launcher startup path; while intended to be observability-only, mistakes could change restart timing or logging behavior and affect readiness detection.

Overview
Adds structured phase instrumentation for Chromium restarts so slow boots can be attributed to specific steps.

restartChromiumAndWait now measures supervisorctl restart duration and the delay until a new DevTools URL appears, emitting chromium_restart_complete (and chromium_restart_timeout with partial timing on timeout).

chromium-launcher now records timing for pkill/pgrep and waitForPort, plus fail-soft VM/socket telemetry from /proc (uptime, load, meminfo, /proc/net/tcp state counts), emitting a single JSON chromium_launcher_phase_complete event immediately before execing Chromium; new unit tests cover the /proc parsers.

Reviewed by Cursor Bugbot for commit 67244b5. Bugbot is set up for automated code reviews on this repo. Configure here.

Production telemetry (downstream consumer) shows that some chromium
restarts take ~6s and exceed callers' CDP-readiness budget, while
others complete in ~1s on the same fleet. Restarts are clearly
bimodal but we don't yet know which phase eats the time on slow ones.

This change adds two structured events that, joined by VM/host
identity downstream, let us attribute each restart to a specific
phase and VM state:

* chromium_launcher_phase_complete (slog JSON to stderr, emitted right
  before syscall.Exec): pkill / pgrep wait / waitForPort timings,
  ESTABLISHED / TIME_WAIT / LISTEN counts on the devtools port from
  /proc/net/tcp, plus VM uptime, 1m load, and memory from /proc.

* chromium_restart_complete (api log line from restartChromiumAndWait):
  decomposes wall clock into supervisorctl_ms and post_supervisorctl_ms
  so we can tell whether slow restarts are spent in supervisord
  SIGKILL / spawn vs chromium-internal init.

killExistingChromium and waitForPort are refactored to return small
phase-timing structs; behavior is unchanged. /proc parsers are
fail-soft and have unit tests with real-format fixtures.

Made-with: Cursor
@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: This PR modifies chromium-launcher instrumentation and server diagnostics, but does not change kernel API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal) as specified in the filter.

To monitor this PR anyway, reply with @firetiger monitor this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant