Skip to content

fix(e2e): retry localdns restart in cold-start validator#8649

Open
jingwenw15 wants to merge 2 commits into
mainfrom
jingwenwu/localdns-coldstart-restart-retry-pr
Open

fix(e2e): retry localdns restart in cold-start validator#8649
jingwenw15 wants to merge 2 commits into
mainfrom
jingwenwu/localdns-coldstart-restart-retry-pr

Conversation

@jingwenw15

@jingwenw15 jingwenw15 commented Jun 5, 2026

Copy link
Copy Markdown
Member

What this PR does / why we need it:

This PR hardens ValidateLocalDNSHostsPluginColdStart against a false-negative restart failure in the E2E itself.

The cold-start validator truncates /etc/localdns/hosts and restarts localdns to validate that lookups fall through to upstream DNS until the hosts file is repopulated. On the scriptless_nbc path, that forced restart is intermittently racy because localdns shutdown/startup includes asynchronous cleanup and reconfiguration. The first systemctl restart localdns can fail transiently even though an immediate retry succeeds with the same empty-hosts-file state, so the test flakes without a product regression.

This change replaces the one-shot restart calls in the validator with a bounded retry helper that:

  • retries systemctl restart localdns up to 3 times with a 5s delay
  • resets the failed state before retrying
  • captures systemctl status and recent journalctl output on failed attempts

This PR only changes the E2E validator. It does not change LocalDNS product behavior, bootstrap order, or scriptless CSE timing.

Which issue(s) this PR fixes:

N/A (test-only flake)

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a bounded-retry wrapper around systemctl restart localdns in the localdns cold-start validator to reduce test flakiness from transient restart failures during async systemd/network cleanup.

Changes:

  • Introduce restart_localdns_with_retry helper with logging and bounded retries.
  • Replace direct systemctl restart localdns calls with the retry helper across the cold-start validation flow.

Comment thread e2e/validators.go Outdated
Comment thread e2e/validators.go Outdated
Comment thread e2e/validators.go
Comment thread e2e/validators.go
echo "$1" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$'
}

# Helper: restart localdns with bounded retries.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of failures we are getting would be best to be investigated instead of adding a retry... this is just hiding the root cause. if you need to wait, best to wait for stability ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants