fix: revert non-root runner bootstrap, keep the rest of Phase 4#19
Merged
Conversation
Phase 4 (#18) landed three independent hardenings in one PR: - New configurable runner-version input (no runtime impact) - Ephemeral + checksum + set -euo pipefail (additive safety) - Root to non-root runner user via sudo-heredoc (behavioral change) The dogfood rotation on terraform-provider-namecheap#182 failed — 'Start self-hosted EC2 runner' timed out at 6m15s waiting for runner registration. EC2 instance booted fine, but whatever the user-data did inside the instance, it didn't end at './run.sh' polling GitHub. We can't post-mortem directly because the instance is ephemeral and already terminated. Fix-forward strategy: revert ONLY the non-root transition (highest-probability culprit among the three axes), keep everything else from Phase 4. If the Phase 4 dogfood rotation passes after this revert, the root-to-runner sudo-heredoc is the breaker and can be investigated as an isolated follow-up (likely candidates: sudoers config on the hardened AMI, SELinux context, config.sh writing outside its own directory, or my heredoc quoting). Landing the safer pieces now unblocks Phases 5/6/7. Kept: - runner-version input (Phase 4's main feature) - set -euo pipefail - --ephemeral + --unattended + --disableupdate on config.sh - SHA-256 verification of the runner tarball - Clearer bash syntax ('case "$(uname -m)"', double-quoted vars) Reverted: - useradd + sudo -u runner -H bash <<'RUNNER_BOOTSTRAP' heredoc - RUNNER_ALLOW_RUNASROOT=1 restored (runner executes as root again) The non-root goal isn't lost — a follow-up issue will propose it again, this time with better instrumentation so we can see what failed inside the instance. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
This was referenced Apr 21, 2026
Closed
kurok
added a commit
that referenced
this pull request
Apr 21, 2026
* revert: full rollback of Phase 4 bootstrap changes Phase 4 attempts #18 (with non-root) and #19 (without non-root but keeping --ephemeral + checksum + set -euo pipefail + runner-version input) BOTH failed the provider dogfood with the same 6m15s runner registration timeout (terraform-provider-namecheap#182 and machulav#183). The fix-forward in #19 narrowed the suspect set from 'all Phase 4 changes' to 'one of: set -euo pipefail, --ephemeral flag, --disableupdate flag, checksum verify, parameterized bash vars'. Still not isolated. Full rollback here restores the known-good Phase 1 bootstrap exactly. Everything else from Phase 1 is preserved (aws-sdk v3, ncc 0.38, jest tests, .gitattributes). Phase 4 work is NOT abandoned — it moves to follow-up issues where each change lands on its own with its own dogfood, so the next failure isolates itself to a single axis instead of requiring bisection across five simultaneous changes. Files reverted to match a1bd2f9 (Phase 1 tip): - action.yml (drops runner-version input) - src/aws.js (original 12-line bash array, yum install libicu make, RUNNER_ALLOW_RUNASROOT=1, no --ephemeral, no checksum verify) - src/config.js (drops runnerVersion field) - tests/config.test.js (drops runner-version test block, 23 -> 21 tests) Dist rebuilt against the reverted src (verify-dist will confirm). Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com> * ci: revert verify-runner-url extractor to grep src/aws.js Paired with the full Phase 4 revert — now that action.yml no longer has a runner-version default, the Phase 4 version of verify-runner-url that reads action.yml can't find the version. Restore the original extractor that greps the literal URL out of src/aws.js. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com> --------- Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
This was referenced Apr 21, 2026
kurok
added a commit
that referenced
this pull request
Apr 21, 2026
…cksum table (#26) Closes #20. Supersedes the reverted #18 / #19 / #21. Implements the full Phase 4 bootstrap hardening from issue #10, with the root-cause fix from #20 baked in. Key differences from the earlier failed attempts: ## The fix for the actual failure Previous attempts died at: curl -fsSL <tarball>.sha256 | awk '{print }' with a 404 (actions/runner doesn't publish per-tarball sidecar files, empirically confirmed via aws ec2 get-console-output on a probe instance — see #20). This PR replaces that with a hardcoded table of expected hashes in src/runner-checksums.js, keyed by 'arch-version'. Two x86_64 / arm64 entries for the currently-pinned v2.333.1, sourced from the release body at github.com/actions/runner/releases/tag/v2.333.1. CI enforces table-vs-upstream consistency on every PR (see pr.yml). ## Everything else from Phase 4 - Non-root 'runner' user (useradd -m, sudo -u runner -H bash heredoc). RUNNER_ALLOW_RUNASROOT=1 escape hatch removed. - New 'runner-version' input in action.yml (default '2.333.1'). To override, add matching x64+arm64 SHAs to runner-checksums.js in the same PR — verify-runner-url CI will reject the change if the hashes don't match upstream. - --ephemeral --unattended --disableupdate on config.sh. GitHub auto-deregisters the runner after its job; disableupdate keeps the binary stable during the short ephemeral session. - set -euo pipefail on both the outer and inner (runner-user) shells. The earlier fatal failure under set -e was the .sha256 404, which no longer exists. - Paramaterized RUNNER_VERSION / TARBALL / BASE bash vars. ## Tests tests/runner-checksums.test.js — 6 new cases covering the table shape, hex format, x64+arm64 parity per version, lookup returns for known/unknown keys. tests/config.test.js — 2 new cases for the runner-version input (default fallback + override). Total: 36 -> 44 tests. ## CI: verify-runner-url overhaul The job now parses the runner-version from action.yml, then: 1. HEADs the Linux x64 release asset (unchanged). 2. Fetches the release body via 'gh api'. 3. Greps the BEGIN SHA linux-x64 / linux-arm64 HTML comments. 4. Cross-checks against the values lookup() returns from src/runner-checksums.js. Drift between the hardcoded table and upstream fails CI at code- review time, not at runtime. ## Dogfood plan (MUCH more careful this time) Provider SHA-pin rotation after merge, same pattern as prior phases. This time I have full EC2 console-output diagnostic capability via the recipe saved in my notes — any new bootstrap failure should be trivially diagnosable rather than opaque. Closing #20 on merge. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of #10's fallout — see also provider-side dogfood failure at terraform-provider-namecheap#182.
What happened
#18 (Phase 4) landed three independent hardenings in one PR:
runner-versioninput (parametric, no runtime impact).set -euo pipefail(additive safety).sudo -u runner -H bash <<'RUNNER_BOOTSTRAP'heredoc (behavioral change).Dogfood rotation on the provider (machulav#182) then failed:
Start self-hosted EC2 runnertimed out at 6m15s waiting for runner registration. The instance booted, but the user-data never got to./run.shpolling GitHub. Can't post-mortem — the instance is ephemeral and already terminated.What this PR does
Revert only item (3) — the sudo-to-runner wrapping, because that's the highest-probability culprit among the three axes. Keep (1) + (2).
runner-versioninputuseradd runnerset -euo pipefailsudo -u runner -H bash <<'RUNNER_BOOTSTRAP'heredoc--ephemeral --unattended --disableupdateRUNNER_ALLOW_RUNASROOT=1escape hatch restoredRUNNER_VERSION/TARBALL/BASEbash varscase "$(uname -m)"syntax, double-quoted varsFollow-up for the non-root goal
Not abandoning the hardening. A new issue will propose the non-root move with better instrumentation — e.g., output the user-data script to CloudWatch or a bucket before it executes, so the EC2 console log shows progress and we can see where it breaks. Likely candidates to investigate when we return to this:
DEVOPS/hardened-amazon-linux2023AMI (requiretty, defaultDefaults env_reset, etc.).runneruser's home.config.shwriting state files outside its own directory.Verification
namecheap/terraform-provider-namecheapci.yml pin → new feat/al2023-support tip). If it goes green, (1)+(2) are vindicated and Phases 5/6/7 can stack on top. If it still fails, the regression is in items (1) or (2) and I'll investigate from there.