Override scaleset client timeout to avoid long controller stalls#4473
Override scaleset client timeout to avoid long controller stalls#4473antoinedeschenes wants to merge 1 commit intoactions:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Reduces Actions Scale Set client request timeout and retry count to prevent long reconciliation stalls when broker.actions.githubusercontent.com becomes intermittently unreachable.
Changes:
- Set scaleset HTTP client timeout to 30 seconds.
- Reduce scaleset HTTP client max retries to 2.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| options := []scaleset.HTTPOption{ | ||
| scaleset.WithTimeout(30 * time.Second), | ||
| scaleset.WithRetryMax(2), | ||
| } |
There was a problem hiding this comment.
The timeout (30s) and retry max (2) are hard-coded here, which makes this operationally hard to tune across environments (e.g., GitHub Enterprise with higher latency vs. public GitHub) and turns a runtime behavior change into an implicit policy. Consider wiring these values from controller configuration/flags (similar to other tuning flags in main.go) and/or at least defining them as named constants with a brief rationale so operators can adjust without code changes.
868ce7d to
7f31ad0
Compare


We've been observing frequent connection issues to
broker.actions.githubusercontent.comin the past weeks, and that has caused the scaleset controller to lock for 15-20 minute periods up to 10 times a day during work hours.By "locking" I mean that EphemeralRunner resources stop being processed. The
statusfield on them starts mismatching the actual runner pod status and new EphemeralRunners created on the cluster received blankstatusfields.In
workqueue_unfinished_work_secondsmetrics, we've noticed that thecontroller="ephemeralrunner"metric would keep raising for that whole "lock" period, and log lines for the EphemeralRunner controller stop being printed. We're also noticingClient.Timeout exceededandcontext canceledonbroker.actions.githubusercontent.comduring that period.It seems to us that the default scaleset client 5 minute timeout and 4 retries configs are contributing to that ~20 minute stall. At least, it can cause any workqueue item to be stuck for way longer than they should be.
Here we've reduced to something more reasonable (30 seconds and 2 retries) and the "stalls" we used to see previously now last under 2 minutes:
I'm happy to change this to use config flags if you judge that's a better option, or please let me know if you think there's a better solution to this.
Note: I also tried to increase "runner-max-concurrent-reconciles", but all connections to broker started failing simultaneously, leaving us with all concurrent calls being stuck in the retry loop.