Part of plan #15. Phase 5 — Lifecycle & Cleanup Reliability.
Problem
Current stop-runner path is best-effort:
removeRunner() calls the GitHub API once. If it 500s or times out, the runner stays registered in GitHub (visible in Settings → Actions → Runners indefinitely).
terminateEc2Instance() calls EC2 TerminateInstances once. If the AWS call times out, the instance keeps running (billing).
- No explicit timeout on
waitForInstanceRunning / waitForRunnerRegistered; a stuck call could pin the job.
Phase 4's --ephemeral mitigates the "stale runner" case via GitHub-side auto-deregistration, but that handles one of the two cleanup paths. Explicit retries on both paths are still the defense-in-depth answer.
Target
- Retry
removeRunner() with exponential backoff (3 attempts, base 2s, max 10s).
- Retry
terminateEc2Instance() with same policy.
- Bounded timeout on
waitForRunnerRegistered (default 5 min; input-overridable).
- Bounded timeout on
waitForInstanceRunning (default 5 min).
- On
mode: stop, attempt both cleanups even if one throws — do not let a GitHub API failure prevent EC2 termination, or vice versa.
- Structured log line on every attempt so the Actions run summary shows what was tried.
Pseudocode shape
async function stop() {
const errors = [];
try { await withRetry(() => gh.removeRunner(), { attempts: 3, backoff: 2000 }); }
catch (e) { errors.push(['gh.removeRunner', e]); }
try { await withRetry(() => aws.terminateEc2Instance(), { attempts: 3, backoff: 2000 }); }
catch (e) { errors.push(['aws.terminateEc2Instance', e]); }
if (errors.length) {
for (const [where, err] of errors) core.error(`${where}: ${err.message}`);
core.setFailed(`stop mode completed with ${errors.length} cleanup failure(s)`);
}
}
Compatibility with consumers
Fully transparent improvement. Consumers today already guard stop-runner with if: always() && ... so the step runs on acceptance-test failure; the retry + bounded timeout makes that guard more reliable.
Acceptance criteria
Part of plan #15. Phase 5 — Lifecycle & Cleanup Reliability.
Problem
Current
stop-runnerpath is best-effort:removeRunner()calls the GitHub API once. If it 500s or times out, the runner stays registered in GitHub (visible in Settings → Actions → Runners indefinitely).terminateEc2Instance()calls EC2TerminateInstancesonce. If the AWS call times out, the instance keeps running (billing).waitForInstanceRunning/waitForRunnerRegistered; a stuck call could pin the job.Phase 4's
--ephemeralmitigates the "stale runner" case via GitHub-side auto-deregistration, but that handles one of the two cleanup paths. Explicit retries on both paths are still the defense-in-depth answer.Target
removeRunner()with exponential backoff (3 attempts, base 2s, max 10s).terminateEc2Instance()with same policy.waitForRunnerRegistered(default 5 min; input-overridable).waitForInstanceRunning(default 5 min).mode: stop, attempt both cleanups even if one throws — do not let a GitHub API failure prevent EC2 termination, or vice versa.Pseudocode shape
Compatibility with consumers
Fully transparent improvement. Consumers today already guard
stop-runnerwithif: always() && ...so the step runs on acceptance-test failure; the retry + bounded timeout makes that guard more reliable.Acceptance criteria
stop()attempts both cleanups independently; neither short-circuits the other.aws-timeout-secondsandgithub-timeout-seconds(optional, defaults sane).