Skip to content

feat: retry + independent cleanup in stop (Phase 5)#23

Merged
kurok merged 1 commit into
feat/al2023-supportfrom
feat/11-retries-timeouts
Apr 21, 2026
Merged

feat: retry + independent cleanup in stop (Phase 5)#23
kurok merged 1 commit into
feat/al2023-supportfrom
feat/11-retries-timeouts

Conversation

@kurok
Copy link
Copy Markdown

@kurok kurok commented Apr 21, 2026

Closes #11. Part of plan #15. Builds on #22's structured logging.

Scope

Small retry helper + wraps the two cleanup paths in stop() + makes stop() attempt both cleanups independently.

src/retry.js

withRetry(step, fn, { attempts = 3, baseMs = 2000, maxMs = 10000 })

Exponential backoff, worst-case total wait 14 s (2 + 4 + 8). Every retry emits log.warn with the step name suffixed _retry; final failure emits log.error with exhausted: true. Callers pass only idempotent operations (DELETE /runners/{id} and TerminateInstances).

Call sites wrapped

  • gh.removeRunner() — DELETE request.
  • aws.terminateEc2Instance() — TerminateInstances command.

Call sites deliberately not wrapped

  • aws.startEc2Instance() / RunInstances — not idempotent without a ClientToken. Retrying on its own has billing implications.
  • aws.waitUntilInstanceRunning — already has maxWaitTime: 300.
  • gh.getRegistrationToken — transient network failure here is rare, and retrying burns registration tokens.
  • gh.waitForRunnerRegistered — already has a 5-minute timeout + 10-second polling loop.

Independent cleanup in stop()

Before: await A; await B — if A throws, B never runs.

After: both wrapped in try/catch, failures accumulated, summary Error thrown at the end. The EC2 instance gets terminated even when GitHub's API is temporarily flaky, and vice versa. This is the most important bit for production — a GitHub outage can no longer leave orphan EC2 instances billing against us.

Tests

4 new cases in tests/retry.test.js:

  • Resolves on first-try success.
  • Retries on rejection, resolves on later success.
  • Exhausts attempts, re-throws, emits correct warn + error logs.
  • Backoff caps at maxMs (inspected via delay values in warn payloads).

Total: 30 → 34 tests.

Consumer impact

Default behavior matches pre-change for the happy path. Retry + independence only affect transient-failure flows. Log traffic on a successful run is unchanged from Phase 7.

Dogfood

Low risk — changes only error-recovery paths. Provider rotation after this merges, same pattern as #22's (which just went green on machulav#185).

Closes #11. Part of plan #15.

## Scope

Add a small retry helper and wrap the two cleanup paths in stop()
with it. Make stop() attempt both cleanups independently so a
GitHub-side failure doesn't prevent EC2 termination and vice versa.

## New module: src/retry.js

  withRetry(step, fn, { attempts, baseMs, maxMs })

- Default: 3 attempts, 2s base, doubled per retry, capped at 10s.
  Worst-case total wait: 2s + 4s + 8s = 14s before giving up.
- Logs each retry via log.warn() with the step name suffixed
  '_retry' so the Actions run log shows attempt sequence.
- On final failure, emits log.error with exhausted: true and
  re-throws the last error.

## Call sites

- gh.removeRunner() — the DELETE /runners/{id} request wrapped in
  withRetry('remove_runner', ...). Idempotent: deleting a runner
  that's already gone returns 404 which is a permanent failure
  (not transient) but the existing get-runner-first path filters
  that out before the DELETE.

- aws.terminateEc2Instance() — the TerminateInstances command
  wrapped in withRetry('terminate_instance', ...). Idempotent:
  terminating an already-terminated instance is a no-op per EC2.

- start() path deliberately not wrapped. RunInstances is not
  idempotent without a ClientToken (billing implication) and the
  runner-version-specific AMI lookup retrying makes no sense.
  If start fails, the whole action fails; the consumer's workflow
  fallback handles retry at the workflow level.

## stop() independence

Before: 'await A; await B'. If A throws, B never runs.

After: both wrapped in try/catch, failures accumulated into a list,
and at the end — if any failed — one Error is thrown summarizing
both. So the EC2 instance gets terminated even when GitHub's API
is temporarily flaky, and vice versa.

## Tests

tests/retry.test.js — 4 cases:
- Resolves on first-try success.
- Retries on rejection and returns the eventual success value.
- Exhausts attempts and re-throws after emitting warn + error logs.
- Backoff caps at maxMs (verified by inspecting the delay values
  emitted in warn logs).

Total: 30 -> 34 tests.

## Consumer impact

Transparent. Default behavior matches what the old code did for
the happy path. The retry / independent-cleanup changes only affect
transient-failure flows that previously surfaced as 'stop-runner
failed' jobs.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
@kurok kurok merged commit 46cf1d0 into feat/al2023-support Apr 21, 2026
4 checks passed
@kurok kurok deleted the feat/11-retries-timeouts branch April 21, 2026 10:08
kurok added a commit to namecheap/terraform-provider-namecheap that referenced this pull request Apr 21, 2026
…cleanup) (#186)

namecheap/ec2-github-runner#23 merged. Phase 5 adds exponential-backoff
retry on removeRunner() + terminateEc2Instance(), and makes stop()
attempt both cleanups independently (a GitHub-side failure no longer
prevents EC2 termination). Default behavior on happy path is unchanged.

Rotation: b1b8d6d (Phase 7, logging) -> 46cf1d0 (Phase 5).

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant