feat: harden add-node workflow for data safety#385
Conversation
- Add `PeerCatchupResource` to block add-node progress until the new node's replication slot catches up via `spock.progress.remote_lsn`, preventing data loss during failover after node addition - Advance replication origin alongside slot during add-node to ensure consistent origin tracking on the subscriber - Treat disabled/down subscriptions as transient in `wait_for_sync_event` to avoid false failures mid-add-node - Bump Postgres image tags to Spock 5.0.8 - Add E2E tests covering add-node replication data safety
📝 WalkthroughWalkthroughThis PR adds replication-origin advancement for safely adding new subscriber nodes to multi-node PostgreSQL/Spock clusters. The changes introduce two new resources (PeerCatchupResource and ReplicationOriginAdvanceResource), modify existing resources to output and handle advancement state, integrate them into the node-population orchestration, validate via updated tests, and bump the Spock container image to 5.0.8. ChangesReplication Origin Advancement for Multi-Node Topologies
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsStopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Category | Results |
|---|---|
| Complexity | 1 medium |
🟢 Metrics 36 complexity · 0 duplication
Metric Results Complexity 36 Duplication 0
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/internal/database/replication_slot_advance_from_cts_resource.go`:
- Around line 151-153: Reset r.AdvancedToLSN to its zero value at the start of
the Create method to avoid leaking a previous run's LSN when early returns
occur; locate the Create function on the ReplicationSlotAdvanceFromCTSResource
receiver and add a single line that clears r.AdvancedToLSN before any
processing/early-return logic so only a successful path sets r.AdvancedToLSN =
targetLSN.
In `@server/internal/database/wait_for_sync_event_resource.go`:
- Around line 104-107: The transient-status branch sleeps with
time.Sleep(pollInterval), which blocks cancellation; change it to a
context-aware wait by replacing the sleep with a select that waits on
time.After(pollInterval) and ctx.Done() (e.g., select { case
<-time.After(pollInterval): continue; case <-ctx.Done(): return ctx.Err() }),
referencing the branch handling postgres.SubStatusDisabled and
postgres.SubStatusDown and using the existing pollInterval and the function's
ctx to abort promptly on cancellation.
In `@server/internal/orchestrator/swarm/images.go`:
- Line 39: The Docker image tags referencing Spock 5.0.8 are invalid and will
break deployments; update the usages of PgEdgeImage and any other imageTag(cfg,
"...-spock5.0.8-standard-1") calls to either revert to the known-working tags
(e.g., the previous non-spock tag variants) or ensure you build and push the
three images (docker.io/pgedge/pgedge-postgres:16.13-spock5.0.8-standard-1,
:17.9-spock5.0.8-standard-1, :18.3-spock5.0.8-standard-1) before merging so
imageTag(cfg, "...") points to existing images.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dca34b4e-bbb1-4f28-871f-b276c368fcf5
📒 Files selected for processing (11)
e2e/add_node_data_safety_test.goserver/internal/database/operations/golden_test/TestUpdateDatabase/two_nodes_to_three_nodes_with_populate.jsonserver/internal/database/operations/populate_nodes.goserver/internal/database/operations/populate_nodes_test.goserver/internal/database/peer_catchup_resource.goserver/internal/database/replication_origin_advance_resource.goserver/internal/database/replication_slot_advance_from_cts_resource.goserver/internal/database/resources.goserver/internal/database/wait_for_sync_event_resource.goserver/internal/orchestrator/swarm/images.goserver/internal/postgres/create_db.go
|
|
||
| versions.addImage(ds.MustPgEdgeVersion("16.13", "5"), &Images{ | ||
| PgEdgeImage: imageTag(cfg, "16.13-spock5.0.7-standard-1"), | ||
| PgEdgeImage: imageTag(cfg, "16.13-spock5.0.8-standard-1"), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify Docker images exist for the new Spock 5.0.8 tags
# Get the image repository host from config or use default
REPO_HOST="${DOCKER_SWARM_IMAGE_REPOSITORY_HOST:-docker.io/pgedge}"
# Check each new image tag
for TAG in "16.13-spock5.0.8-standard-1" "17.9-spock5.0.8-standard-1" "18.3-spock5.0.8-standard-1"; do
echo "Checking ${REPO_HOST}/pgedge-postgres:${TAG}"
# Try to fetch image manifest
docker manifest inspect "${REPO_HOST}/pgedge-postgres:${TAG}" > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "✓ Image exists: ${TAG}"
else
echo "✗ Image NOT found: ${TAG}"
fi
doneRepository: pgEdge/control-plane
Length of output: 411
Docker images for Spock 5.0.8 do not exist—deployments will fail.
The code references three Docker images that do not exist in the repository:
- docker.io/pgedge/pgedge-postgres:16.13-spock5.0.8-standard-1
- docker.io/pgedge/pgedge-postgres:17.9-spock5.0.8-standard-1
- docker.io/pgedge/pgedge-postgres:18.3-spock5.0.8-standard-1
Either build and push these images before merging, or revert the tag updates to existing images.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@server/internal/orchestrator/swarm/images.go` at line 39, The Docker image
tags referencing Spock 5.0.8 are invalid and will break deployments; update the
usages of PgEdgeImage and any other imageTag(cfg, "...-spock5.0.8-standard-1")
calls to either revert to the known-working tags (e.g., the previous non-spock
tag variants) or ensure you build and push the three images
(docker.io/pgedge/pgedge-postgres:16.13-spock5.0.8-standard-1,
:17.9-spock5.0.8-standard-1, :18.3-spock5.0.8-standard-1) before merging so
imageTag(cfg, "...") points to existing images.
There was a problem hiding this comment.
Images are pending postgres-images PR #20 — will not merge until those are published.
There was a problem hiding this comment.
@tsivaprasad, understood — thanks for the context. As long as postgres-images PR #20 is merged and the images are published before this PR lands, the deployment concern is addressed. You may want to add a note or a PR dependency link to help reviewers track that gate.
🧠 Learnings used
Learnt from: rshoemaker
Repo: pgEdge/control-plane PR: 266
File: server/internal/workflows/update_database.go:122-153
Timestamp: 2026-02-09T21:47:46.904Z
Learning: In server/internal/workflows/update_database.go and related Go workflow files (e.g., provision_services.go), implement idempotent provisioning semantics: re-running ProvisionServices on already-provisioned service instances should be safe and non-destructive. Upsert operations (e.g., StoreServiceInstance) may temporarily set state to 'creating' but should recover back to 'running' via monitoring. Use SQL-level idempotence for user/service roles (e.g., CREATE ROLE IF NOT EXISTS / ALTER ROLE) and ensure deployment diffs compare desired vs. current state. These patterns support recovery when a user retries with corrected specs after a failure, and should be considered in similar Go workflow files across the repository.
Learnt from: rshoemaker
Repo: pgEdge/control-plane PR: 280
File: server/internal/orchestrator/swarm/mcp_config_resource.go:90-94
Timestamp: 2026-03-02T20:15:14.804Z
Learning: In MCP config resources (e.g., server/internal/orchestrator/swarm/mcp_config_resource.go and similar tokens.yaml/users.yaml), Refresh should only verify the existence of config.yaml. Tokens.yaml and users.yaml are owned and managed by the MCP server at runtime and should not be rewritten on Update. Do not trigger recreation that would overwrite runtime state. This ownership model is consistent with the comment indicating Do NOT touch tokens.yaml or users.yaml. Apply this guidance to all MCP config resource files in the swarm directory.
Learnt from: rshoemaker
Repo: pgEdge/control-plane PR: 280
File: server/internal/orchestrator/swarm/mcp_config.go:95-97
Timestamp: 2026-03-02T20:17:36.231Z
Learning: In the pgEdge/control-plane repository, internal (unexported) functions that are called through controlled paths by the resource framework should assume validated input from their invocation boundaries. Do not add defensive nil guards inside such internal helpers; rely on validation at the resource construction/invocation boundaries (e.g., MCPConfigResource.Create/Update) to guarantee data integrity. This guideline applies to all Go files under server/internal where internal helpers serve as implementation details behind controlled APIs.
Learnt from: rshoemaker
Repo: pgEdge/control-plane PR: 287
File: server/internal/orchestrator/swarm/service_user_role.go:59-60
Timestamp: 2026-03-11T16:05:27.239Z
Learning: In server/internal/orchestrator/swarm/service_user_role.go, treat a ResourceVersion bump (e.g., 2 -> 3) as a documentation/safety marker only. When the resource Identifier key changes (e.g., ServiceInstanceID to ServiceID), the system will automatically recreate resources due to the Type/ID key mismatch. Do not flag a ResourceVersion bump as a missing recreate trigger if an Identifier change is already causing a full recreate for that resource. This guidance is file-specific and should be applied when reviewing changes to service_user_role.go.
Learnt from: jason-lynch
Repo: pgEdge/control-plane PR: 293
File: server/internal/orchestrator/common/paths.go:79-157
Timestamp: 2026-03-24T13:24:39.229Z
Learning: In pgEdge/control-plane, when reviewing Go packages under server/ that use the samber/do injector, apply the `Provide()` requirement only to packages that expose singleton components which are consumed as dependencies by other singleton components via the injector. Do NOT require `Provide()` for pure utility/value-type packages (e.g., path helpers, data structures) that are not injected as singleton dependencies.
Learnt from: jason-lynch
Repo: pgEdge/control-plane PR: 312
File: server/internal/workflows/backend/etcd/store.go:52-58
Timestamp: 2026-03-24T19:00:21.971Z
Learning: In pgEdge/control-plane, treat errors returned from application startup/initialization functions as fatal: a startup failure should trigger the normal application shutdown path, which then calls each component’s `Shutdown`/`Stop` for cleanup. Therefore, when reviewing code in startup/initialization paths (e.g., cache/etcd initialization like `StartCaches`), do not require explicit rollback/cleanup of partially-initialized resources inside the startup function based on its error return. Avoid review suggestions that push intermediate rollback patterns for these fatal startup errors, since the existing shutdown sequence is responsible for teardown.
Learnt from: jason-lynch
Repo: pgEdge/control-plane PR: 353
File: server/internal/orchestrator/common/pgbackrest_stanza.go:26-29
Timestamp: 2026-04-20T13:57:55.700Z
Learning: In the pgEdge control-plane codebase, treat `ResourceVersion()` as a cache-busting/diff-forcing mechanism for resources. Increment `ResourceVersion()` only when making a behavioral change that requires consumers to re-fetch/re-evaluate the resource, but where no serialized fields are changed. Do not bump `ResourceVersion()` solely because a field was removed—removing serialized fields should already produce a diff on its own.
Learnt from: jason-lynch
Repo: pgEdge/control-plane PR: 353
File: server/internal/orchestrator/common/pgbackrest_stanza.go:26-29
Timestamp: 2026-04-20T13:57:55.700Z
Learning: In the pgEdge control-plane codebase, avoid using `DiffIgnore()` with JSON paths for fields that have been removed. `DiffIgnore()` suppresses diffs for the specified JSON field, which prevents the resource from being re-evaluated after a field is removed. For removed fields, do not add them to `DiffIgnore()`; instead let the diff occur naturally, relying on the idempotency of Update methods to handle the change safely.
Learnt from: jason-lynch
Repo: pgEdge/control-plane PR: 377
File: server/internal/monitor/provide.go:60-61
Timestamp: 2026-05-05T12:30:06.386Z
Learning: In pgEdge/control-plane, the storage layer (e.g., WatchOp) uses deferred execution: etcd connections are not created in constructors/initializers. When reviewing Go code, do not report “etcd resource leak” for code paths that merely construct candidate/watch/election objects (e.g., via electionSvc.NewCandidate) as long as the relevant execution method (Exec/Start or equivalent) is not called afterward. Only consider it a potential leak if the code actually starts/executes the watcher/candidate and fails to shut it down (if the project has a corresponding stop/close path).
Summary
This PR hardens the add-node (ZODAN) workflow to prevent data loss during
failover after a node is added, and aligns slot creation with Spock
5.0.8 behavior.
Changes
PeerCatchupResourceto block add-node progress until the newnode's replication slot has caught up via
spock.progress.remote_lsnconsistent origin tracking on the subscriber
wait_for_sync_eventto avoid false failures mid-add-nodeTesting
Verification:
Checklist