Phase 5: lifecycle & cleanup reliability — retries, timeouts, always-cleanup

Part of plan #15. **Phase 5 — Lifecycle & Cleanup Reliability.**

## Problem

Current `stop-runner` path is best-effort:
- `removeRunner()` calls the GitHub API once. If it 500s or times out, the runner stays registered in GitHub (visible in Settings → Actions → Runners indefinitely).
- `terminateEc2Instance()` calls EC2 `TerminateInstances` once. If the AWS call times out, the instance keeps running (billing).
- No explicit timeout on `waitForInstanceRunning` / `waitForRunnerRegistered`; a stuck call could pin the job.

Phase 4's `--ephemeral` mitigates the "stale runner" case via GitHub-side auto-deregistration, but that handles one of the two cleanup paths. Explicit retries on both paths are still the defense-in-depth answer.

## Target

- Retry `removeRunner()` with exponential backoff (3 attempts, base 2s, max 10s).
- Retry `terminateEc2Instance()` with same policy.
- Bounded timeout on `waitForRunnerRegistered` (default 5 min; input-overridable).
- Bounded timeout on `waitForInstanceRunning` (default 5 min).
- On `mode: stop`, attempt both cleanups even if one throws — do not let a GitHub API failure prevent EC2 termination, or vice versa.
- Structured log line on every attempt so the Actions run summary shows what was tried.

## Pseudocode shape

```js
async function stop() {
  const errors = [];
  try { await withRetry(() => gh.removeRunner(), { attempts: 3, backoff: 2000 }); }
  catch (e) { errors.push(['gh.removeRunner', e]); }

  try { await withRetry(() => aws.terminateEc2Instance(), { attempts: 3, backoff: 2000 }); }
  catch (e) { errors.push(['aws.terminateEc2Instance', e]); }

  if (errors.length) {
    for (const [where, err] of errors) core.error(`${where}: ${err.message}`);
    core.setFailed(`stop mode completed with ${errors.length} cleanup failure(s)`);
  }
}
```

## Compatibility with consumers

Fully transparent improvement. Consumers today already guard `stop-runner` with `if: always() && ...` so the step runs on acceptance-test failure; the retry + bounded timeout makes that guard more reliable.

## Acceptance criteria

- [ ] `stop()` attempts both cleanups independently; neither short-circuits the other.
- [ ] 3-attempt exponential backoff on both AWS and GitHub calls.
- [ ] New inputs `aws-timeout-seconds` and `github-timeout-seconds` (optional, defaults sane).
- [ ] Structured log lines for every attempt, visible in the Actions run summary.
- [ ] Unit test: inject a failing first attempt; verify the second succeeds.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 5: lifecycle & cleanup reliability — retries, timeouts, always-cleanup #11

Problem

Target

Pseudocode shape

Compatibility with consumers

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Phase 5: lifecycle & cleanup reliability — retries, timeouts, always-cleanup #11

Description

Problem

Target

Pseudocode shape

Compatibility with consumers

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions