Skip to content

Override scaleset client timeout to avoid long controller stalls#4473

Open
antoinedeschenes wants to merge 1 commit intoactions:masterfrom
antoinedeschenes:scaleset-client-timeout
Open

Override scaleset client timeout to avoid long controller stalls#4473
antoinedeschenes wants to merge 1 commit intoactions:masterfrom
antoinedeschenes:scaleset-client-timeout

Conversation

@antoinedeschenes
Copy link
Copy Markdown

We've been observing frequent connection issues to broker.actions.githubusercontent.com in the past weeks, and that has caused the scaleset controller to lock for 15-20 minute periods up to 10 times a day during work hours.

By "locking" I mean that EphemeralRunner resources stop being processed. The status field on them starts mismatching the actual runner pod status and new EphemeralRunners created on the cluster received blank status fields.

In workqueue_unfinished_work_seconds metrics, we've noticed that the controller="ephemeralrunner" metric would keep raising for that whole "lock" period, and log lines for the EphemeralRunner controller stop being printed. We're also noticing Client.Timeout exceeded and context canceled on broker.actions.githubusercontent.com during that period.

  2026-04-16 19:49:29 [DEBUG] DELETE https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview                                             
  2026-04-16 19:54:29 [ERR] DELETE https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview request failed: Delete                        
  "https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview": context deadline exceeded (Client.Timeout exceeded while awaiting headers)  
  2026-04-16 19:54:29 [DEBUG] DELETE https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview: retrying in 1s (4 left)                    
  2026-04-16 19:59:11 [ERR] DELETE https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview request failed: Delete                        
  "https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview": context canceled                                                            

It seems to us that the default scaleset client 5 minute timeout and 4 retries configs are contributing to that ~20 minute stall. At least, it can cause any workqueue item to be stuck for way longer than they should be.

Here we've reduced to something more reasonable (30 seconds and 2 retries) and the "stalls" we used to see previously now last under 2 minutes:

  2026-04-24 11:46:18 [ERR] DELETE https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview request failed: Delete                        
  "https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview": context deadline exceeded (Client.Timeout exceeded while awaiting headers)  
  2026-04-24 11:46:49 [ERR] DELETE https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview request failed: Delete
  "https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview": context deadline exceeded (Client.Timeout exceeded while awaiting headers)  
  2026-04-24 11:47:21 [ERR] DELETE https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview request failed: Delete
  "https://broker.actions.githubusercontent.com/rest/_apis/distributedtask/pools/0/agents/REDACTED?api-version=6.0-preview": context deadline exceeded (Client.Timeout exceeded while awaiting headers)  

I'm happy to change this to use config flags if you judge that's a better option, or please let me know if you think there's a better solution to this.

Note: I also tried to increase "runner-max-concurrent-reconciles", but all connections to broker started failing simultaneously, leaving us with all concurrent calls being stuck in the retry loop.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces Actions Scale Set client request timeout and retry count to prevent long reconciliation stalls when broker.actions.githubusercontent.com becomes intermittently unreachable.

Changes:

  • Set scaleset HTTP client timeout to 30 seconds.
  • Reduce scaleset HTTP client max retries to 2.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +139 to +142
options := []scaleset.HTTPOption{
scaleset.WithTimeout(30 * time.Second),
scaleset.WithRetryMax(2),
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout (30s) and retry max (2) are hard-coded here, which makes this operationally hard to tune across environments (e.g., GitHub Enterprise with higher latency vs. public GitHub) and turns a runtime behavior change into an implicit policy. Consider wiring these values from controller configuration/flags (similar to other tuning flags in main.go) and/or at least defining them as named constants with a brief rationale so operators can adjust without code changes.

Copilot uses AI. Check for mistakes.
@antoinedeschenes antoinedeschenes force-pushed the scaleset-client-timeout branch from 868ce7d to 7f31ad0 Compare April 24, 2026 13:21
@prune998
Copy link
Copy Markdown

I can second this.

We usually see metrics like
image

so, reconcile not finishing, piling up, but no reconcile errors added.
It clearly seems that the deletes are hanging on the GitHub API, usually for 15-20 mins, before timeouting.

With this patch applied, we clearly see fewer unfinished work but we do see reconcile errors increasing slowly.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants