feat: retry + independent cleanup in stop (Phase 5) by kurok · Pull Request #23 · namecheap/ec2-github-runner

kurok · 2026-04-21T09:48:08Z

Closes #11. Part of plan #15. Builds on #22's structured logging.

Scope

Small retry helper + wraps the two cleanup paths in stop() + makes stop() attempt both cleanups independently.

`src/retry.js`

withRetry(step, fn, { attempts = 3, baseMs = 2000, maxMs = 10000 })

Exponential backoff, worst-case total wait 14 s (2 + 4 + 8). Every retry emits log.warn with the step name suffixed _retry; final failure emits log.error with exhausted: true. Callers pass only idempotent operations (DELETE /runners/{id} and TerminateInstances).

Call sites wrapped

gh.removeRunner() — DELETE request.
aws.terminateEc2Instance() — TerminateInstances command.

Call sites deliberately not wrapped

aws.startEc2Instance() / RunInstances — not idempotent without a ClientToken. Retrying on its own has billing implications.
aws.waitUntilInstanceRunning — already has maxWaitTime: 300.
gh.getRegistrationToken — transient network failure here is rare, and retrying burns registration tokens.
gh.waitForRunnerRegistered — already has a 5-minute timeout + 10-second polling loop.

Independent cleanup in `stop()`

Before: await A; await B — if A throws, B never runs.

After: both wrapped in try/catch, failures accumulated, summary Error thrown at the end. The EC2 instance gets terminated even when GitHub's API is temporarily flaky, and vice versa. This is the most important bit for production — a GitHub outage can no longer leave orphan EC2 instances billing against us.

Tests

4 new cases in tests/retry.test.js:

Resolves on first-try success.
Retries on rejection, resolves on later success.
Exhausts attempts, re-throws, emits correct warn + error logs.
Backoff caps at maxMs (inspected via delay values in warn payloads).

Total: 30 → 34 tests.

Consumer impact

Default behavior matches pre-change for the happy path. Retry + independence only affect transient-failure flows. Log traffic on a successful run is unchanged from Phase 7.

Dogfood

Low risk — changes only error-recovery paths. Provider rotation after this merges, same pattern as #22's (which just went green on machulav#185).

Closes #11. Part of plan #15. ## Scope Add a small retry helper and wrap the two cleanup paths in stop() with it. Make stop() attempt both cleanups independently so a GitHub-side failure doesn't prevent EC2 termination and vice versa. ## New module: src/retry.js withRetry(step, fn, { attempts, baseMs, maxMs }) - Default: 3 attempts, 2s base, doubled per retry, capped at 10s. Worst-case total wait: 2s + 4s + 8s = 14s before giving up. - Logs each retry via log.warn() with the step name suffixed '_retry' so the Actions run log shows attempt sequence. - On final failure, emits log.error with exhausted: true and re-throws the last error. ## Call sites - gh.removeRunner() — the DELETE /runners/{id} request wrapped in withRetry('remove_runner', ...). Idempotent: deleting a runner that's already gone returns 404 which is a permanent failure (not transient) but the existing get-runner-first path filters that out before the DELETE. - aws.terminateEc2Instance() — the TerminateInstances command wrapped in withRetry('terminate_instance', ...). Idempotent: terminating an already-terminated instance is a no-op per EC2. - start() path deliberately not wrapped. RunInstances is not idempotent without a ClientToken (billing implication) and the runner-version-specific AMI lookup retrying makes no sense. If start fails, the whole action fails; the consumer's workflow fallback handles retry at the workflow level. ## stop() independence Before: 'await A; await B'. If A throws, B never runs. After: both wrapped in try/catch, failures accumulated into a list, and at the end — if any failed — one Error is thrown summarizing both. So the EC2 instance gets terminated even when GitHub's API is temporarily flaky, and vice versa. ## Tests tests/retry.test.js — 4 cases: - Resolves on first-try success. - Retries on rejection and returns the eventual success value. - Exhausts attempts and re-throws after emitting warn + error logs. - Backoff caps at maxMs (verified by inspecting the delay values emitted in warn logs). Total: 30 -> 34 tests. ## Consumer impact Transparent. Default behavior matches what the old code did for the happy path. The retry / independent-cleanup changes only affect transient-failure flows that previously surfaced as 'stop-runner failed' jobs. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>

…cleanup) (#186) namecheap/ec2-github-runner#23 merged. Phase 5 adds exponential-backoff retry on removeRunner() + terminateEc2Instance(), and makes stop() attempt both cleanups independently (a GitHub-side failure no longer prevents EC2 termination). Default behavior on happy path is unchanged. Rotation: b1b8d6d (Phase 7, logging) -> 46cf1d0 (Phase 5). Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>

kurok merged commit 46cf1d0 into feat/al2023-support Apr 21, 2026
4 checks passed

kurok deleted the feat/11-retries-timeouts branch April 21, 2026 10:08

This was referenced Apr 21, 2026

Phase 5: lifecycle & cleanup reliability — retries, timeouts, always-cleanup #11

Closed

ci: rotate ec2-github-runner SHA to Phase 5 tip (retry + independent cleanup) namecheap/terraform-provider-namecheap#186

Merged

kurok mentioned this pull request Apr 21, 2026

EC2 GitHub Runner Hardening & Modernization Plan (tracker) #15

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: retry + independent cleanup in stop (Phase 5)#23

feat: retry + independent cleanup in stop (Phase 5)#23
kurok merged 1 commit into
feat/al2023-supportfrom
feat/11-retries-timeouts

kurok commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kurok commented Apr 21, 2026

Scope

src/retry.js

Call sites wrapped

Call sites deliberately not wrapped

Independent cleanup in stop()

Tests

Consumer impact

Dogfood

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`src/retry.js`

Independent cleanup in `stop()`