Skip to content

feat(e2e): isolate Kubeclient rate limiters to prevent test flakes#8677

Merged
awesomenix merged 5 commits into
mainfrom
timmy/e2e-throttling
Jun 10, 2026
Merged

feat(e2e): isolate Kubeclient rate limiters to prevent test flakes#8677
awesomenix merged 5 commits into
mainfrom
timmy/e2e-throttling

Conversation

@timmy-wright

@timmy-wright timmy-wright commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Problem

E2E tests intermittently fail with:

prepare cluster tasks: dag execution failed: waiting for proxy pod to be ready:
listing proxy pods: client rate limiter Wait returned an error: context deadline exceeded

This happens because all DAG tasks in prepareCluster share a single Kubeclient (and its rate limiter). When collectGarbageVMSS deletes many stale nodes, it consumes all rate limiter tokens, starving GetProxyURL's polling. Similarly, parallel test scenarios sharing the same cached Cluster.Kube compete for tokens during validation.

Test runs:

Fix

Three layers of Kubeclient isolation:

  1. Per-test Kubeclient — each scenario creates its own client via Cluster.NewKubeclientForTest(), so parallel validations don't compete on the same rate limiter.

  2. Per-DAG-task Kubeclients — within prepareCluster, heavy operations (collectGarbageVMSS, EnsureDebugDaemonsets/GetProxyURL, extractClusterParams, ACR setup) each get their own independent client. Kubeconfig bytes are fetched once, then each task mints its own client.

  3. Extracted NewKubeclient factory — reusable function that creates a Kubeclient from raw kubeconfig bytes with consistent config (200 QPS, 400 burst, TCP keepalive, HTTP/2 pings).

Changes

  • e2e/kube.go — Extract NewKubeclient(kubeconfigBytes) factory
  • e2e/cluster.go — Add KubeConfig field, NewKubeclientForTest() method, per-task DAG clients
  • e2e/types.go — Add Kube field to ScenarioRuntime
  • e2e/test_helpers.go — Create per-test client, use it for WaitUntilNodeReady
  • e2e/validators.go, validation.go, exec.go, etc. — Replace s.Runtime.Cluster.Kubes.Runtime.Kube

Testing

  • go build ./...
  • go vet ./...
  • No behavioral change to test logic — same operations, just independent rate limiters

Each parallel E2E scenario now gets its own Kubeclient with an independent
rate limiter, preventing validation operations from starving each other.

Within prepareCluster's DAG, heavy operations (collectGarbageVMSS,
EnsureDebugDaemonsets/GetProxyURL, extractClusterParams, ACR setup) each
use their own dedicated Kubeclient so that e.g. bulk node deletion in GC
cannot starve the proxy pod readiness polling.

Changes:
- Extract NewKubeclient(kubeconfigBytes) factory from getClusterKubeClient
- Add Cluster.KubeConfig field and NewKubeclientForTest() method
- Add ScenarioRuntime.Kube for per-test client isolation
- Replace single kube DAG node with per-task clients in prepareCluster
- Update all s.Runtime.Cluster.Kube references to s.Runtime.Kube

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces e2e test flakes caused by shared Kubernetes client-go rate limiter contention by ensuring concurrent operations (across parallel scenarios and within prepareCluster DAG tasks) use independent Kubeclient instances.

Changes:

  • Added a reusable NewKubeclient([]byte) factory that creates a Kubeclient from raw kubeconfig bytes with consistent REST config tuning.
  • Introduced a per-test ScenarioRuntime.Kube client and updated validators/helpers to use it instead of the shared Cluster.Kube.
  • Updated prepareCluster to fetch kubeconfig bytes once and mint separate Kubeclients for heavy DAG tasks to prevent rate limiter starvation.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
e2e/validators.go Switched validation paths to use per-test s.Runtime.Kube rather than the shared cluster client.
e2e/validation.go Updated validation helpers and pod exec paths to use s.Runtime.Kube.
e2e/validate_localdns_exporter_metrics.go Updated node lookup to use per-test s.Runtime.Kube.
e2e/types.go Added ScenarioRuntime.Kube field to hold a per-test Kubeclient.
e2e/test_helpers.go Created and stored a per-test Kubeclient; used it for node readiness polling.
e2e/scenario_gpu_daemonset_test.go Updated daemonset operations to use s.Runtime.Kube.
e2e/kube.go Extracted NewKubeclient(kubeconfigBytes) factory and refactored cluster client creation to use it.
e2e/exec.go Updated unprivileged pod exec helper to use s.Runtime.Kube.
e2e/cluster.go Added kubeconfig byte retention and per-DAG-task Kubeclient minting; added NewKubeclientForTest().

Copilot AI review requested due to automatic review settings June 10, 2026 00:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Comment thread e2e/cluster.go
timmy-wright and others added 2 commits June 10, 2026 12:38
Proxy pod readiness can be delayed by slow MCR image pulls or
transient node pressure on the system pool. 5 minutes is insufficient
in CI environments under load.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@awesomenix awesomenix merged commit bc2bba8 into main Jun 10, 2026
19 of 22 checks passed
@awesomenix awesomenix deleted the timmy/e2e-throttling branch June 10, 2026 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants