chromium-launcher: instrument restart phases to localise slow boots#230
Open
hiroTamada wants to merge 1 commit intomainfrom
Open
chromium-launcher: instrument restart phases to localise slow boots#230hiroTamada wants to merge 1 commit intomainfrom
hiroTamada wants to merge 1 commit intomainfrom
Conversation
Production telemetry (downstream consumer) shows that some chromium restarts take ~6s and exceed callers' CDP-readiness budget, while others complete in ~1s on the same fleet. Restarts are clearly bimodal but we don't yet know which phase eats the time on slow ones. This change adds two structured events that, joined by VM/host identity downstream, let us attribute each restart to a specific phase and VM state: * chromium_launcher_phase_complete (slog JSON to stderr, emitted right before syscall.Exec): pkill / pgrep wait / waitForPort timings, ESTABLISHED / TIME_WAIT / LISTEN counts on the devtools port from /proc/net/tcp, plus VM uptime, 1m load, and memory from /proc. * chromium_restart_complete (api log line from restartChromiumAndWait): decomposes wall clock into supervisorctl_ms and post_supervisorctl_ms so we can tell whether slow restarts are spent in supervisord SIGKILL / spawn vs chromium-internal init. killExistingChromium and waitForPort are refactored to return small phase-timing structs; behavior is unchanged. /proc parsers are fail-soft and have unit tests with real-format fixtures. Made-with: Cursor
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: This PR modifies chromium-launcher instrumentation and server diagnostics, but does not change kernel API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal) as specified in the filter. To monitor this PR anyway, reply with |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
When chromium is restarted (e.g. after a flag/policy change or a profile load), callers need DevTools to come back up within a tight budget. In our downstream production deployment we see that some restarts take ~6 s and exceed that budget — causing user-visible failures — while others complete in ~1 s on the very same fleet. Restarts are clearly bimodal, but right now we cannot tell which phase eats the time on the slow ones.
The plausible suspects are:
supervisorctl restart chromiumitself (SIGKILL + spawn) being slow on a degraded host.chromium-launcher'skillExistingChromium(pkill -9+pgrepwait) orwaitForPort. We disableSO_REUSEADDR, so any lingering FDs / TIME_WAIT entries on the devtools port force us to wait for kernel cleanup — up to the 5 swaitForPortdeadline.syscall.Exec(cache warmup, profile work) running slowly post-suspend / post-standby-restore.This PR adds the instrumentation we need to tell those apart.
What
Two structured events. Both are correlatable downstream by host / VM identity, so they can be joined with each other and with caller-side CDP-readiness timings to get an end-to-end phase breakdown.
1.
chromium_launcher_phase_complete(inchromium-launcher)Emitted as slog JSON to stderr right before
syscall.Exec. Captures:pkill_elapsed_ms,pgrep_wait_elapsed_ms,pgrep_attempts,pgrep_timed_out,waitport_elapsed_ms,waitport_attempts,waitport_timed_out,waitport_last_err,launcher_total_ms./proc/net/tcp:ss_estab_devtools,ss_timewait_devtools,ss_listen_devtools— directly tests the lingering-FD / TIME_WAIT hypothesis.vm_uptime_ms,load_avg_1m,mem_available_kb,mem_cached_kb,mem_dirty_kb— directly tests the post-standby-restore / host-pressure hypothesis.instance_name(read fromINSTANCE_NAMEenv var if set) for downstream attribution.killExistingChromiumandwaitForPortare refactored to return small phase-timing structs (`killStats`, `waitportStats`); behavior is unchanged.2.
chromium_restart_complete(in `server/cmd/api/api/chromium.go`)Emitted from `restartChromiumAndWait` after a successful restart. Decomposes the wall clock into:
This separates "supervisord was slow to SIGKILL+respawn" from "chromium itself was slow to initialise."
Combined with the launcher event we get: `supervisorctl_ms` → `launcher_total_ms` (with sub-phases) → chromium internal init → DevTools URL written.
`/proc` parsers
New helpers `readVMUptimeMs`, `readLoadAvg1m`, `parseMemInfo`, `countSocketStates` are all fail-soft (return zero on missing files / parse errors) and have unit tests in `main_test.go` with realistic `/proc/net/tcp` and `/proc/meminfo` fixtures (using testify `assert` / `require`).
Test plan
Notes
cc @
Made with Cursor
Note
Medium Risk
Touches the Chromium restart/wait control flow and launcher startup path; while intended to be observability-only, mistakes could change restart timing or logging behavior and affect readiness detection.
Overview
Adds structured phase instrumentation for Chromium restarts so slow boots can be attributed to specific steps.
restartChromiumAndWaitnow measuressupervisorctl restartduration and the delay until a new DevTools URL appears, emittingchromium_restart_complete(andchromium_restart_timeoutwith partial timing on timeout).chromium-launchernow records timing forpkill/pgrepandwaitForPort, plus fail-soft VM/socket telemetry from/proc(uptime, load, meminfo,/proc/net/tcpstate counts), emitting a single JSONchromium_launcher_phase_completeevent immediately beforeexecing Chromium; new unit tests cover the/procparsers.Reviewed by Cursor Bugbot for commit 67244b5. Bugbot is set up for automated code reviews on this repo. Configure here.