feat(e2e): isolate Kubeclient rate limiters to prevent test flakes#8677
Merged
Conversation
Each parallel E2E scenario now gets its own Kubeclient with an independent rate limiter, preventing validation operations from starving each other. Within prepareCluster's DAG, heavy operations (collectGarbageVMSS, EnsureDebugDaemonsets/GetProxyURL, extractClusterParams, ACR setup) each use their own dedicated Kubeclient so that e.g. bulk node deletion in GC cannot starve the proxy pod readiness polling. Changes: - Extract NewKubeclient(kubeconfigBytes) factory from getClusterKubeClient - Add Cluster.KubeConfig field and NewKubeclientForTest() method - Add ScenarioRuntime.Kube for per-test client isolation - Replace single kube DAG node with per-task clients in prepareCluster - Update all s.Runtime.Cluster.Kube references to s.Runtime.Kube Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reduces e2e test flakes caused by shared Kubernetes client-go rate limiter contention by ensuring concurrent operations (across parallel scenarios and within prepareCluster DAG tasks) use independent Kubeclient instances.
Changes:
- Added a reusable
NewKubeclient([]byte)factory that creates aKubeclientfrom raw kubeconfig bytes with consistent REST config tuning. - Introduced a per-test
ScenarioRuntime.Kubeclient and updated validators/helpers to use it instead of the sharedCluster.Kube. - Updated
prepareClusterto fetch kubeconfig bytes once and mint separate Kubeclients for heavy DAG tasks to prevent rate limiter starvation.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| e2e/validators.go | Switched validation paths to use per-test s.Runtime.Kube rather than the shared cluster client. |
| e2e/validation.go | Updated validation helpers and pod exec paths to use s.Runtime.Kube. |
| e2e/validate_localdns_exporter_metrics.go | Updated node lookup to use per-test s.Runtime.Kube. |
| e2e/types.go | Added ScenarioRuntime.Kube field to hold a per-test Kubeclient. |
| e2e/test_helpers.go | Created and stored a per-test Kubeclient; used it for node readiness polling. |
| e2e/scenario_gpu_daemonset_test.go | Updated daemonset operations to use s.Runtime.Kube. |
| e2e/kube.go | Extracted NewKubeclient(kubeconfigBytes) factory and refactored cluster client creation to use it. |
| e2e/exec.go | Updated unprivileged pod exec helper to use s.Runtime.Kube. |
| e2e/cluster.go | Added kubeconfig byte retention and per-DAG-task Kubeclient minting; added NewKubeclientForTest(). |
awesomenix
approved these changes
Jun 10, 2026
Proxy pod readiness can be delayed by slow MCR image pulls or transient node pressure on the system pool. 5 minutes is insufficient in CI environments under load. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
awesomenix
approved these changes
Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
E2E tests intermittently fail with:
This happens because all DAG tasks in
prepareClustershare a single Kubeclient (and its rate limiter). WhencollectGarbageVMSSdeletes many stale nodes, it consumes all rate limiter tokens, starvingGetProxyURL's polling. Similarly, parallel test scenarios sharing the same cachedCluster.Kubecompete for tokens during validation.Test runs:
Fix
Three layers of Kubeclient isolation:
Per-test Kubeclient — each scenario creates its own client via
Cluster.NewKubeclientForTest(), so parallel validations don't compete on the same rate limiter.Per-DAG-task Kubeclients — within
prepareCluster, heavy operations (collectGarbageVMSS,EnsureDebugDaemonsets/GetProxyURL,extractClusterParams, ACR setup) each get their own independent client. Kubeconfig bytes are fetched once, then each task mints its own client.Extracted
NewKubeclientfactory — reusable function that creates a Kubeclient from raw kubeconfig bytes with consistent config (200 QPS, 400 burst, TCP keepalive, HTTP/2 pings).Changes
e2e/kube.go— ExtractNewKubeclient(kubeconfigBytes)factorye2e/cluster.go— AddKubeConfigfield,NewKubeclientForTest()method, per-task DAG clientse2e/types.go— AddKubefield toScenarioRuntimee2e/test_helpers.go— Create per-test client, use it forWaitUntilNodeReadye2e/validators.go,validation.go,exec.go, etc. — Replaces.Runtime.Cluster.Kube→s.Runtime.KubeTesting
go build ./...✅go vet ./...✅