feat: harden add-node workflow for data safety by tsivaprasad · Pull Request #385 · pgEdge/control-plane

tsivaprasad · 2026-05-14T12:30:56Z

Summary

This PR hardens the add-node (ZODAN) workflow to prevent data loss during
failover after a node is added, and aligns slot creation with Spock
5.0.8 behavior.

Changes

Add PeerCatchupResource to block add-node progress until the new
node's replication slot has caught up via spock.progress.remote_lsn
Advance replication origin alongside slot during add-node to ensure
consistent origin tracking on the subscriber
Treat disabled/down subscriptions as transient in
wait_for_sync_event to avoid false failures mid-add-node
Updated to Spock 5.0.8

Testing

Verification:

Checklist

Tests added and updated

- Add `PeerCatchupResource` to block add-node progress until the new node's replication slot catches up via `spock.progress.remote_lsn`, preventing data loss during failover after node addition - Advance replication origin alongside slot during add-node to ensure consistent origin tracking on the subscriber - Treat disabled/down subscriptions as transient in `wait_for_sync_event` to avoid false failures mid-add-node - Bump Postgres image tags to Spock 5.0.8 - Add E2E tests covering add-node replication data safety

coderabbitai · 2026-05-14T12:31:09Z

📝 Walkthrough

Walkthrough

This PR adds replication-origin advancement for safely adding new subscriber nodes to multi-node PostgreSQL/Spock clusters. The changes introduce two new resources (PeerCatchupResource and ReplicationOriginAdvanceResource), modify existing resources to output and handle advancement state, integrate them into the node-population orchestration, validate via updated tests, and bump the Spock container image to 5.0.8.

Changes

Replication Origin Advancement for Multi-Node Topologies

Layer / File(s)	Summary
PostgreSQL replication origin helpers `server/internal/postgres/create_db.go`	New functions EnsureReplicationOriginExists, AdvanceReplicationOrigin, and SpockProgressReachedLSN provide replication origin management and peer progress tracking on the provider side.
PeerCatchupResource implementation `server/internal/database/peer_catchup_resource.go`	PeerCatchupResource polls until a peer node's apply progress reaches a sync event LSN using SpockProgressReachedLSN, with resource metadata, identifier helpers, and resource lifecycle methods.
ReplicationOriginAdvanceResource implementation `server/internal/database/replication_origin_advance_resource.go`	ReplicationOriginAdvanceResource advances the replication origin on the subscriber node to the LSN computed by ReplicationSlotAdvanceFromCTSResource, ensuring apply resumes from the correct position.
Existing resource output and status handling `server/internal/database/replication_slot_advance_from_cts_resource.go`, `server/internal/database/wait_for_sync_event_resource.go`	ReplicationSlotAdvanceFromCTSResource outputs AdvancedToLSN for subscriber consumption and ignores it during diffing. WaitForSyncEventResource treats disabled/down subscription states as transient and retryable.
Resource orchestration in PopulateNode `server/internal/database/operations/populate_nodes.go`, `server/internal/database/resources.go`	PopulateNode wires PeerCatchupResourceIdentifier into sync dependencies. addPeerResources creates both PeerCatchupResource and ReplicationOriginAdvanceResource. RegisterResourceTypes registers both new resource types.
Unit and golden test expectations `server/internal/database/operations/populate_nodes_test.go`, `server/internal/database/operations/golden_test/TestUpdateDatabase/two_nodes_to_three_nodes_with_populate.json`	Test expectations updated with PeerCatchupResource entries, PeerCatchupResourceIdentifier dependencies, and ReplicationOriginAdvanceResource replacing ReplicationSlotAdvanceFromCTSResource in multi-node scenarios.
End-to-end replication origin test `e2e/add_node_data_safety_test.go`	New TestAddNodeOriginAdvanced verifies replication origin advancement after adding a subscriber node by provisioning a 3-node topology, writing data for WAL advancement, and asserting the new subscriber's origin progresses past zero-LSN.
Container image version update `server/internal/orchestrator/swarm/images.go`	Spock container image tag updated to 5.0.8-standard-1 for PgEdge versions 16.13/5, 17.9/5, and 18.3/5.

🐰 New resources hop into place,
Peers catch up with grace and pace,
Origins advance with care,
Multi-node replication, fair!
Spock 5.0.8 leads the way,
Safer clusters every day. 🌟

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description covers the main changes and includes a testing section, but the Testing section is largely empty with only 'Verification:' listed and no actual test commands or steps provided.	Expand the Testing section with specific commands (e.g., 'go test ./...') and any manual E2E verification steps performed to validate the changes.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: harden add-node workflow for data safety' clearly and concisely summarizes the main objective: hardening the add-node workflow with a focus on data safety.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch PLAT-520-harden-zodan-workflow-to-match-spock-data-loss-fixes

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codacy-production · 2026-05-14T12:31:53Z

Up to standards ✅

🟢 Issues 1 medium

Results:
1 new issue

Category Results

Complexity 1 medium

View in Codacy

🟢 Metrics 36 complexity · 0 duplication

Metric Results

Complexity 36

Duplication 0

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/internal/database/replication_slot_advance_from_cts_resource.go`:
- Around line 151-153: Reset r.AdvancedToLSN to its zero value at the start of
the Create method to avoid leaking a previous run's LSN when early returns
occur; locate the Create function on the ReplicationSlotAdvanceFromCTSResource
receiver and add a single line that clears r.AdvancedToLSN before any
processing/early-return logic so only a successful path sets r.AdvancedToLSN =
targetLSN.

In `@server/internal/database/wait_for_sync_event_resource.go`:
- Around line 104-107: The transient-status branch sleeps with
time.Sleep(pollInterval), which blocks cancellation; change it to a
context-aware wait by replacing the sleep with a select that waits on
time.After(pollInterval) and ctx.Done() (e.g., select { case
<-time.After(pollInterval): continue; case <-ctx.Done(): return ctx.Err() }),
referencing the branch handling postgres.SubStatusDisabled and
postgres.SubStatusDown and using the existing pollInterval and the function's
ctx to abort promptly on cancellation.

In `@server/internal/orchestrator/swarm/images.go`:
- Line 39: The Docker image tags referencing Spock 5.0.8 are invalid and will
break deployments; update the usages of PgEdgeImage and any other imageTag(cfg,
"...-spock5.0.8-standard-1") calls to either revert to the known-working tags
(e.g., the previous non-spock tag variants) or ensure you build and push the
three images (docker.io/pgedge/pgedge-postgres:16.13-spock5.0.8-standard-1,
:17.9-spock5.0.8-standard-1, :18.3-spock5.0.8-standard-1) before merging so
imageTag(cfg, "...") points to existing images.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dca34b4e-bbb1-4f28-871f-b276c368fcf5

📥 Commits

Reviewing files that changed from the base of the PR and between 5dc438d and ca0900c.

📒 Files selected for processing (11)

e2e/add_node_data_safety_test.go
server/internal/database/operations/golden_test/TestUpdateDatabase/two_nodes_to_three_nodes_with_populate.json
server/internal/database/operations/populate_nodes.go
server/internal/database/operations/populate_nodes_test.go
server/internal/database/peer_catchup_resource.go
server/internal/database/replication_origin_advance_resource.go
server/internal/database/replication_slot_advance_from_cts_resource.go
server/internal/database/resources.go
server/internal/database/wait_for_sync_event_resource.go
server/internal/orchestrator/swarm/images.go
server/internal/postgres/create_db.go

coderabbitai · 2026-05-14T12:41:54Z


 	versions.addImage(ds.MustPgEdgeVersion("16.13", "5"), &Images{
-		PgEdgeImage: imageTag(cfg, "16.13-spock5.0.7-standard-1"),
+		PgEdgeImage: imageTag(cfg, "16.13-spock5.0.8-standard-1"),


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Verify Docker images exist for the new Spock 5.0.8 tags # Get the image repository host from config or use default REPO_HOST="${DOCKER_SWARM_IMAGE_REPOSITORY_HOST:-docker.io/pgedge}" # Check each new image tag for TAG in "16.13-spock5.0.8-standard-1" "17.9-spock5.0.8-standard-1" "18.3-spock5.0.8-standard-1"; do echo "Checking ${REPO_HOST}/pgedge-postgres:${TAG}" # Try to fetch image manifest docker manifest inspect "${REPO_HOST}/pgedge-postgres:${TAG}" > /dev/null 2>&1 if [ $? -eq 0 ]; then echo "✓ Image exists: ${TAG}" else echo "✗ Image NOT found: ${TAG}" fi done

Repository: pgEdge/control-plane

Length of output: 411

Docker images for Spock 5.0.8 do not exist—deployments will fail.

The code references three Docker images that do not exist in the repository:

docker.io/pgedge/pgedge-postgres:16.13-spock5.0.8-standard-1

docker.io/pgedge/pgedge-postgres:17.9-spock5.0.8-standard-1

docker.io/pgedge/pgedge-postgres:18.3-spock5.0.8-standard-1

Either build and push these images before merging, or revert the tag updates to existing images.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/internal/orchestrator/swarm/images.go` at line 39, The Docker image tags referencing Spock 5.0.8 are invalid and will break deployments; update the usages of PgEdgeImage and any other imageTag(cfg, "...-spock5.0.8-standard-1") calls to either revert to the known-working tags (e.g., the previous non-spock tag variants) or ensure you build and push the three images (docker.io/pgedge/pgedge-postgres:16.13-spock5.0.8-standard-1, :17.9-spock5.0.8-standard-1, :18.3-spock5.0.8-standard-1) before merging so imageTag(cfg, "...") points to existing images.

Images are pending postgres-images PR #20 — will not merge until those are published.

@tsivaprasad, understood — thanks for the context. As long as postgres-images PR #20 is merged and the images are published before this PR lands, the deployment concern is addressed. You may want to add a note or a PR dependency link to help reviewers track that gate.

🧠 Learnings used

Learnt from: rshoemaker Repo: pgEdge/control-plane PR: 266 File: server/internal/workflows/update_database.go:122-153 Timestamp: 2026-02-09T21:47:46.904Z Learning: In server/internal/workflows/update_database.go and related Go workflow files (e.g., provision_services.go), implement idempotent provisioning semantics: re-running ProvisionServices on already-provisioned service instances should be safe and non-destructive. Upsert operations (e.g., StoreServiceInstance) may temporarily set state to 'creating' but should recover back to 'running' via monitoring. Use SQL-level idempotence for user/service roles (e.g., CREATE ROLE IF NOT EXISTS / ALTER ROLE) and ensure deployment diffs compare desired vs. current state. These patterns support recovery when a user retries with corrected specs after a failure, and should be considered in similar Go workflow files across the repository.

Learnt from: rshoemaker Repo: pgEdge/control-plane PR: 280 File: server/internal/orchestrator/swarm/mcp_config_resource.go:90-94 Timestamp: 2026-03-02T20:15:14.804Z Learning: In MCP config resources (e.g., server/internal/orchestrator/swarm/mcp_config_resource.go and similar tokens.yaml/users.yaml), Refresh should only verify the existence of config.yaml. Tokens.yaml and users.yaml are owned and managed by the MCP server at runtime and should not be rewritten on Update. Do not trigger recreation that would overwrite runtime state. This ownership model is consistent with the comment indicating Do NOT touch tokens.yaml or users.yaml. Apply this guidance to all MCP config resource files in the swarm directory.

Learnt from: rshoemaker Repo: pgEdge/control-plane PR: 280 File: server/internal/orchestrator/swarm/mcp_config.go:95-97 Timestamp: 2026-03-02T20:17:36.231Z Learning: In the pgEdge/control-plane repository, internal (unexported) functions that are called through controlled paths by the resource framework should assume validated input from their invocation boundaries. Do not add defensive nil guards inside such internal helpers; rely on validation at the resource construction/invocation boundaries (e.g., MCPConfigResource.Create/Update) to guarantee data integrity. This guideline applies to all Go files under server/internal where internal helpers serve as implementation details behind controlled APIs.

Learnt from: rshoemaker Repo: pgEdge/control-plane PR: 287 File: server/internal/orchestrator/swarm/service_user_role.go:59-60 Timestamp: 2026-03-11T16:05:27.239Z Learning: In server/internal/orchestrator/swarm/service_user_role.go, treat a ResourceVersion bump (e.g., 2 -> 3) as a documentation/safety marker only. When the resource Identifier key changes (e.g., ServiceInstanceID to ServiceID), the system will automatically recreate resources due to the Type/ID key mismatch. Do not flag a ResourceVersion bump as a missing recreate trigger if an Identifier change is already causing a full recreate for that resource. This guidance is file-specific and should be applied when reviewing changes to service_user_role.go.

Learnt from: jason-lynch Repo: pgEdge/control-plane PR: 293 File: server/internal/orchestrator/common/paths.go:79-157 Timestamp: 2026-03-24T13:24:39.229Z Learning: In pgEdge/control-plane, when reviewing Go packages under server/ that use the samber/do injector, apply the `Provide()` requirement only to packages that expose singleton components which are consumed as dependencies by other singleton components via the injector. Do NOT require `Provide()` for pure utility/value-type packages (e.g., path helpers, data structures) that are not injected as singleton dependencies.

Learnt from: jason-lynch Repo: pgEdge/control-plane PR: 312 File: server/internal/workflows/backend/etcd/store.go:52-58 Timestamp: 2026-03-24T19:00:21.971Z Learning: In pgEdge/control-plane, treat errors returned from application startup/initialization functions as fatal: a startup failure should trigger the normal application shutdown path, which then calls each component’s `Shutdown`/`Stop` for cleanup. Therefore, when reviewing code in startup/initialization paths (e.g., cache/etcd initialization like `StartCaches`), do not require explicit rollback/cleanup of partially-initialized resources inside the startup function based on its error return. Avoid review suggestions that push intermediate rollback patterns for these fatal startup errors, since the existing shutdown sequence is responsible for teardown.

Learnt from: jason-lynch Repo: pgEdge/control-plane PR: 353 File: server/internal/orchestrator/common/pgbackrest_stanza.go:26-29 Timestamp: 2026-04-20T13:57:55.700Z Learning: In the pgEdge control-plane codebase, treat `ResourceVersion()` as a cache-busting/diff-forcing mechanism for resources. Increment `ResourceVersion()` only when making a behavioral change that requires consumers to re-fetch/re-evaluate the resource, but where no serialized fields are changed. Do not bump `ResourceVersion()` solely because a field was removed—removing serialized fields should already produce a diff on its own.

Learnt from: jason-lynch Repo: pgEdge/control-plane PR: 353 File: server/internal/orchestrator/common/pgbackrest_stanza.go:26-29 Timestamp: 2026-04-20T13:57:55.700Z Learning: In the pgEdge control-plane codebase, avoid using `DiffIgnore()` with JSON paths for fields that have been removed. `DiffIgnore()` suppresses diffs for the specified JSON field, which prevents the resource from being re-evaluated after a field is removed. For removed fields, do not add them to `DiffIgnore()`; instead let the diff occur naturally, relying on the idempotency of Update methods to handle the change safely.

Learnt from: jason-lynch Repo: pgEdge/control-plane PR: 377 File: server/internal/monitor/provide.go:60-61 Timestamp: 2026-05-05T12:30:06.386Z Learning: In pgEdge/control-plane, the storage layer (e.g., WatchOp) uses deferred execution: etcd connections are not created in constructors/initializers. When reviewing Go code, do not report “etcd resource leak” for code paths that merely construct candidate/watch/election objects (e.g., via electionSvc.NewCandidate) as long as the relevant execution method (Exec/Start or equivalent) is not called afterward. Only consider it a potential leak if the code actually starts/executes the watcher/candidate and fails to shut it down (if the project has a corresponding stop/close path).

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

addressing AI review comments

1ee14f7

tsivaprasad marked this pull request as draft May 14, 2026 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: harden add-node workflow for data safety#385

feat: harden add-node workflow for data safety#385
tsivaprasad wants to merge 2 commits into
mainfrom
PLAT-520-harden-zodan-workflow-to-match-spock-data-loss-fixes

tsivaprasad commented May 14, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning, 1 inconclusive)

Review ran into problems

Uh oh!

codacy-production Bot commented May 14, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 14, 2026 •

edited

Loading

Uh oh!

tsivaprasad May 14, 2026

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tsivaprasad commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Checklist

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning, 1 inconclusive)

Review ran into problems

Uh oh!

codacy-production Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tsivaprasad May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tsivaprasad commented May 14, 2026 •

edited

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

codacy-production Bot commented May 14, 2026 •

edited

Loading

coderabbitai Bot May 14, 2026 •

edited

Loading