Override scaleset client timeout to avoid long controller stalls by antoinedeschenes · Pull Request #4473 · actions/actions-runner-controller

antoinedeschenes · 2026-04-24T13:11:35Z

We've been observing frequent connection issues to broker.actions.githubusercontent.com in the past weeks, and that has caused the scaleset controller to lock for 15-20 minute periods up to 10 times a day during work hours.

By "locking" I mean that EphemeralRunner resources stop being processed. The status field on them starts mismatching the actual runner pod status and new EphemeralRunners created on the cluster received blank status fields.

In workqueue_unfinished_work_seconds metrics, we've noticed that the controller="ephemeralrunner" metric would keep raising for that whole "lock" period, and log lines for the EphemeralRunner controller stop being printed. We're also noticing Client.Timeout exceeded and context canceled on broker.actions.githubusercontent.com during that period.

  2026-04-16 19:49:29 [DEBUG] DELETE https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview                                             
  2026-04-16 19:54:29 [ERR] DELETE https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview request failed: Delete                        
  "https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview": context deadline exceeded (Client.Timeout exceeded while awaiting headers)  
  2026-04-16 19:54:29 [DEBUG] DELETE https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview: retrying in 1s (4 left)                    
  2026-04-16 19:59:11 [ERR] DELETE https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview request failed: Delete                        
  "https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview": context canceled

It seems to us that the default scaleset client 5 minute timeout and 4 retries configs are contributing to that ~20 minute stall. At least, it can cause any workqueue item to be stuck for way longer than they should be.

Here we've reduced to something more reasonable (30 seconds and 2 retries) and the "stalls" we used to see previously now last under 2 minutes:

  2026-04-24 11:46:18 [ERR] DELETE https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview request failed: Delete                        
  "https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview": context deadline exceeded (Client.Timeout exceeded while awaiting headers)  
  2026-04-24 11:46:49 [ERR] DELETE https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview request failed: Delete
  "https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview": context deadline exceeded (Client.Timeout exceeded while awaiting headers)  
  2026-04-24 11:47:21 [ERR] DELETE https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview request failed: Delete
  "https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

I'm happy to change this to use config flags if you judge that's a better option, or please let me know if you think there's a better solution to this.

Note: I also tried to increase "runner-max-concurrent-reconciles", but all connections to broker started failing simultaneously, leaving us with all concurrent calls being stuck in the retry loop.

Copilot

Pull request overview

Reduces Actions Scale Set client request timeout and retry count to prevent long reconciliation stalls when broker.actions.githubusercontent.com becomes intermittently unreachable.

Changes:

Set scaleset HTTP client timeout to 30 seconds.
Reduce scaleset HTTP client max retries to 2.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T13:14:25Z

+	options := []scaleset.HTTPOption{
+		scaleset.WithTimeout(30 * time.Second),
+		scaleset.WithRetryMax(2),
+	}


The timeout (30s) and retry max (2) are hard-coded here, which makes this operationally hard to tune across environments (e.g., GitHub Enterprise with higher latency vs. public GitHub) and turns a runtime behavior change into an implicit policy. Consider wiring these values from controller configuration/flags (similar to other tuning flags in main.go) and/or at least defining them as named constants with a brief rationale so operators can adjust without code changes.

prune998 · 2026-04-24T15:27:22Z

I can second this.

We usually see metrics like

so, reconcile not finishing, piling up, but no reconcile errors added.
It clearly seems that the deletes are hanging on the GitHub API, usually for 15-20 mins, before timeouting.

With this patch applied, we clearly see fewer unfinished work but we do see reconcile errors increasing slowly.

Copilot AI review requested due to automatic review settings April 24, 2026 13:11

antoinedeschenes requested review from a team, Steve-Glass, mumoshu, nikola-jokic, rentziass and toast-gear as code owners April 24, 2026 13:11

Copilot started reviewing on behalf of antoinedeschenes April 24, 2026 13:12 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Override scaleset client timeout to avoid long controller stalls

7f31ad0

antoinedeschenes force-pushed the scaleset-client-timeout branch from 868ce7d to 7f31ad0 Compare April 24, 2026 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Override scaleset client timeout to avoid long controller stalls#4473

Override scaleset client timeout to avoid long controller stalls#4473
antoinedeschenes wants to merge 1 commit intoactions:masterfrom
antoinedeschenes:scaleset-client-timeout

antoinedeschenes commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

prune998 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

antoinedeschenes commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

prune998 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants