Skip to content

feat: harden add-node workflow for data safety#385

Draft
tsivaprasad wants to merge 2 commits into
mainfrom
PLAT-520-harden-zodan-workflow-to-match-spock-data-loss-fixes
Draft

feat: harden add-node workflow for data safety#385
tsivaprasad wants to merge 2 commits into
mainfrom
PLAT-520-harden-zodan-workflow-to-match-spock-data-loss-fixes

Conversation

@tsivaprasad
Copy link
Copy Markdown
Contributor

@tsivaprasad tsivaprasad commented May 14, 2026

Summary

This PR hardens the add-node (ZODAN) workflow to prevent data loss during
failover after a node is added, and aligns slot creation with Spock
5.0.8 behavior.

Changes

  • Add PeerCatchupResource to block add-node progress until the new
    node's replication slot has caught up via spock.progress.remote_lsn
  • Advance replication origin alongside slot during add-node to ensure
    consistent origin tracking on the subscriber
  • Treat disabled/down subscriptions as transient in
    wait_for_sync_event to avoid false failures mid-add-node
  • Updated to Spock 5.0.8

Testing

Verification:

Checklist

  • Tests added and updated

- Add `PeerCatchupResource` to block add-node progress until the
  new node's replication slot catches up via `spock.progress.remote_lsn`,
  preventing data loss during failover after node addition
- Advance replication origin alongside slot during add-node to
  ensure consistent origin tracking on the subscriber
- Treat disabled/down subscriptions as transient in
  `wait_for_sync_event` to avoid false failures mid-add-node
- Bump Postgres image tags to Spock 5.0.8
- Add E2E tests covering add-node replication data safety
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds replication-origin advancement for safely adding new subscriber nodes to multi-node PostgreSQL/Spock clusters. The changes introduce two new resources (PeerCatchupResource and ReplicationOriginAdvanceResource), modify existing resources to output and handle advancement state, integrate them into the node-population orchestration, validate via updated tests, and bump the Spock container image to 5.0.8.

Changes

Replication Origin Advancement for Multi-Node Topologies

Layer / File(s) Summary
PostgreSQL replication origin helpers
server/internal/postgres/create_db.go
New functions EnsureReplicationOriginExists, AdvanceReplicationOrigin, and SpockProgressReachedLSN provide replication origin management and peer progress tracking on the provider side.
PeerCatchupResource implementation
server/internal/database/peer_catchup_resource.go
PeerCatchupResource polls until a peer node's apply progress reaches a sync event LSN using SpockProgressReachedLSN, with resource metadata, identifier helpers, and resource lifecycle methods.
ReplicationOriginAdvanceResource implementation
server/internal/database/replication_origin_advance_resource.go
ReplicationOriginAdvanceResource advances the replication origin on the subscriber node to the LSN computed by ReplicationSlotAdvanceFromCTSResource, ensuring apply resumes from the correct position.
Existing resource output and status handling
server/internal/database/replication_slot_advance_from_cts_resource.go, server/internal/database/wait_for_sync_event_resource.go
ReplicationSlotAdvanceFromCTSResource outputs AdvancedToLSN for subscriber consumption and ignores it during diffing. WaitForSyncEventResource treats disabled/down subscription states as transient and retryable.
Resource orchestration in PopulateNode
server/internal/database/operations/populate_nodes.go, server/internal/database/resources.go
PopulateNode wires PeerCatchupResourceIdentifier into sync dependencies. addPeerResources creates both PeerCatchupResource and ReplicationOriginAdvanceResource. RegisterResourceTypes registers both new resource types.
Unit and golden test expectations
server/internal/database/operations/populate_nodes_test.go, server/internal/database/operations/golden_test/TestUpdateDatabase/two_nodes_to_three_nodes_with_populate.json
Test expectations updated with PeerCatchupResource entries, PeerCatchupResourceIdentifier dependencies, and ReplicationOriginAdvanceResource replacing ReplicationSlotAdvanceFromCTSResource in multi-node scenarios.
End-to-end replication origin test
e2e/add_node_data_safety_test.go
New TestAddNodeOriginAdvanced verifies replication origin advancement after adding a subscriber node by provisioning a 3-node topology, writing data for WAL advancement, and asserting the new subscriber's origin progresses past zero-LSN.
Container image version update
server/internal/orchestrator/swarm/images.go
Spock container image tag updated to 5.0.8-standard-1 for PgEdge versions 16.13/5, 17.9/5, and 18.3/5.

🐰 New resources hop into place,
Peers catch up with grace and pace,
Origins advance with care,
Multi-node replication, fair!
Spock 5.0.8 leads the way,
Safer clusters every day. 🌟

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description covers the main changes and includes a testing section, but the Testing section is largely empty with only 'Verification:' listed and no actual test commands or steps provided. Expand the Testing section with specific commands (e.g., 'go test ./...') and any manual E2E verification steps performed to validate the changes.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: harden add-node workflow for data safety' clearly and concisely summarizes the main objective: hardening the add-node workflow with a focus on data safety.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch PLAT-520-harden-zodan-workflow-to-match-spock-data-loss-fixes

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 14, 2026

Up to standards ✅

🟢 Issues 1 medium

Results:
1 new issue

Category Results
Complexity 1 medium

View in Codacy

🟢 Metrics 36 complexity · 0 duplication

Metric Results
Complexity 36
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/internal/database/replication_slot_advance_from_cts_resource.go`:
- Around line 151-153: Reset r.AdvancedToLSN to its zero value at the start of
the Create method to avoid leaking a previous run's LSN when early returns
occur; locate the Create function on the ReplicationSlotAdvanceFromCTSResource
receiver and add a single line that clears r.AdvancedToLSN before any
processing/early-return logic so only a successful path sets r.AdvancedToLSN =
targetLSN.

In `@server/internal/database/wait_for_sync_event_resource.go`:
- Around line 104-107: The transient-status branch sleeps with
time.Sleep(pollInterval), which blocks cancellation; change it to a
context-aware wait by replacing the sleep with a select that waits on
time.After(pollInterval) and ctx.Done() (e.g., select { case
<-time.After(pollInterval): continue; case <-ctx.Done(): return ctx.Err() }),
referencing the branch handling postgres.SubStatusDisabled and
postgres.SubStatusDown and using the existing pollInterval and the function's
ctx to abort promptly on cancellation.

In `@server/internal/orchestrator/swarm/images.go`:
- Line 39: The Docker image tags referencing Spock 5.0.8 are invalid and will
break deployments; update the usages of PgEdgeImage and any other imageTag(cfg,
"...-spock5.0.8-standard-1") calls to either revert to the known-working tags
(e.g., the previous non-spock tag variants) or ensure you build and push the
three images (docker.io/pgedge/pgedge-postgres:16.13-spock5.0.8-standard-1,
:17.9-spock5.0.8-standard-1, :18.3-spock5.0.8-standard-1) before merging so
imageTag(cfg, "...") points to existing images.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dca34b4e-bbb1-4f28-871f-b276c368fcf5

📥 Commits

Reviewing files that changed from the base of the PR and between 5dc438d and ca0900c.

📒 Files selected for processing (11)
  • e2e/add_node_data_safety_test.go
  • server/internal/database/operations/golden_test/TestUpdateDatabase/two_nodes_to_three_nodes_with_populate.json
  • server/internal/database/operations/populate_nodes.go
  • server/internal/database/operations/populate_nodes_test.go
  • server/internal/database/peer_catchup_resource.go
  • server/internal/database/replication_origin_advance_resource.go
  • server/internal/database/replication_slot_advance_from_cts_resource.go
  • server/internal/database/resources.go
  • server/internal/database/wait_for_sync_event_resource.go
  • server/internal/orchestrator/swarm/images.go
  • server/internal/postgres/create_db.go

Comment thread server/internal/database/wait_for_sync_event_resource.go

versions.addImage(ds.MustPgEdgeVersion("16.13", "5"), &Images{
PgEdgeImage: imageTag(cfg, "16.13-spock5.0.7-standard-1"),
PgEdgeImage: imageTag(cfg, "16.13-spock5.0.8-standard-1"),
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify Docker images exist for the new Spock 5.0.8 tags

# Get the image repository host from config or use default
REPO_HOST="${DOCKER_SWARM_IMAGE_REPOSITORY_HOST:-docker.io/pgedge}"

# Check each new image tag
for TAG in "16.13-spock5.0.8-standard-1" "17.9-spock5.0.8-standard-1" "18.3-spock5.0.8-standard-1"; do
  echo "Checking ${REPO_HOST}/pgedge-postgres:${TAG}"
  
  # Try to fetch image manifest
  docker manifest inspect "${REPO_HOST}/pgedge-postgres:${TAG}" > /dev/null 2>&1
  
  if [ $? -eq 0 ]; then
    echo "✓ Image exists: ${TAG}"
  else
    echo "✗ Image NOT found: ${TAG}"
  fi
done

Repository: pgEdge/control-plane

Length of output: 411


Docker images for Spock 5.0.8 do not exist—deployments will fail.

The code references three Docker images that do not exist in the repository:

  • docker.io/pgedge/pgedge-postgres:16.13-spock5.0.8-standard-1
  • docker.io/pgedge/pgedge-postgres:17.9-spock5.0.8-standard-1
  • docker.io/pgedge/pgedge-postgres:18.3-spock5.0.8-standard-1

Either build and push these images before merging, or revert the tag updates to existing images.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/orchestrator/swarm/images.go` at line 39, The Docker image
tags referencing Spock 5.0.8 are invalid and will break deployments; update the
usages of PgEdgeImage and any other imageTag(cfg, "...-spock5.0.8-standard-1")
calls to either revert to the known-working tags (e.g., the previous non-spock
tag variants) or ensure you build and push the three images
(docker.io/pgedge/pgedge-postgres:16.13-spock5.0.8-standard-1,
:17.9-spock5.0.8-standard-1, :18.3-spock5.0.8-standard-1) before merging so
imageTag(cfg, "...") points to existing images.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Images are pending postgres-images PR #20 — will not merge until those are published.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsivaprasad, understood — thanks for the context. As long as postgres-images PR #20 is merged and the images are published before this PR lands, the deployment concern is addressed. You may want to add a note or a PR dependency link to help reviewers track that gate.


🧠 Learnings used
Learnt from: rshoemaker
Repo: pgEdge/control-plane PR: 266
File: server/internal/workflows/update_database.go:122-153
Timestamp: 2026-02-09T21:47:46.904Z
Learning: In server/internal/workflows/update_database.go and related Go workflow files (e.g., provision_services.go), implement idempotent provisioning semantics: re-running ProvisionServices on already-provisioned service instances should be safe and non-destructive. Upsert operations (e.g., StoreServiceInstance) may temporarily set state to 'creating' but should recover back to 'running' via monitoring. Use SQL-level idempotence for user/service roles (e.g., CREATE ROLE IF NOT EXISTS / ALTER ROLE) and ensure deployment diffs compare desired vs. current state. These patterns support recovery when a user retries with corrected specs after a failure, and should be considered in similar Go workflow files across the repository.

Learnt from: rshoemaker
Repo: pgEdge/control-plane PR: 280
File: server/internal/orchestrator/swarm/mcp_config_resource.go:90-94
Timestamp: 2026-03-02T20:15:14.804Z
Learning: In MCP config resources (e.g., server/internal/orchestrator/swarm/mcp_config_resource.go and similar tokens.yaml/users.yaml), Refresh should only verify the existence of config.yaml. Tokens.yaml and users.yaml are owned and managed by the MCP server at runtime and should not be rewritten on Update. Do not trigger recreation that would overwrite runtime state. This ownership model is consistent with the comment indicating Do NOT touch tokens.yaml or users.yaml. Apply this guidance to all MCP config resource files in the swarm directory.

Learnt from: rshoemaker
Repo: pgEdge/control-plane PR: 280
File: server/internal/orchestrator/swarm/mcp_config.go:95-97
Timestamp: 2026-03-02T20:17:36.231Z
Learning: In the pgEdge/control-plane repository, internal (unexported) functions that are called through controlled paths by the resource framework should assume validated input from their invocation boundaries. Do not add defensive nil guards inside such internal helpers; rely on validation at the resource construction/invocation boundaries (e.g., MCPConfigResource.Create/Update) to guarantee data integrity. This guideline applies to all Go files under server/internal where internal helpers serve as implementation details behind controlled APIs.

Learnt from: rshoemaker
Repo: pgEdge/control-plane PR: 287
File: server/internal/orchestrator/swarm/service_user_role.go:59-60
Timestamp: 2026-03-11T16:05:27.239Z
Learning: In server/internal/orchestrator/swarm/service_user_role.go, treat a ResourceVersion bump (e.g., 2 -> 3) as a documentation/safety marker only. When the resource Identifier key changes (e.g., ServiceInstanceID to ServiceID), the system will automatically recreate resources due to the Type/ID key mismatch. Do not flag a ResourceVersion bump as a missing recreate trigger if an Identifier change is already causing a full recreate for that resource. This guidance is file-specific and should be applied when reviewing changes to service_user_role.go.

Learnt from: jason-lynch
Repo: pgEdge/control-plane PR: 293
File: server/internal/orchestrator/common/paths.go:79-157
Timestamp: 2026-03-24T13:24:39.229Z
Learning: In pgEdge/control-plane, when reviewing Go packages under server/ that use the samber/do injector, apply the `Provide()` requirement only to packages that expose singleton components which are consumed as dependencies by other singleton components via the injector. Do NOT require `Provide()` for pure utility/value-type packages (e.g., path helpers, data structures) that are not injected as singleton dependencies.

Learnt from: jason-lynch
Repo: pgEdge/control-plane PR: 312
File: server/internal/workflows/backend/etcd/store.go:52-58
Timestamp: 2026-03-24T19:00:21.971Z
Learning: In pgEdge/control-plane, treat errors returned from application startup/initialization functions as fatal: a startup failure should trigger the normal application shutdown path, which then calls each component’s `Shutdown`/`Stop` for cleanup. Therefore, when reviewing code in startup/initialization paths (e.g., cache/etcd initialization like `StartCaches`), do not require explicit rollback/cleanup of partially-initialized resources inside the startup function based on its error return. Avoid review suggestions that push intermediate rollback patterns for these fatal startup errors, since the existing shutdown sequence is responsible for teardown.

Learnt from: jason-lynch
Repo: pgEdge/control-plane PR: 353
File: server/internal/orchestrator/common/pgbackrest_stanza.go:26-29
Timestamp: 2026-04-20T13:57:55.700Z
Learning: In the pgEdge control-plane codebase, treat `ResourceVersion()` as a cache-busting/diff-forcing mechanism for resources. Increment `ResourceVersion()` only when making a behavioral change that requires consumers to re-fetch/re-evaluate the resource, but where no serialized fields are changed. Do not bump `ResourceVersion()` solely because a field was removed—removing serialized fields should already produce a diff on its own.

Learnt from: jason-lynch
Repo: pgEdge/control-plane PR: 353
File: server/internal/orchestrator/common/pgbackrest_stanza.go:26-29
Timestamp: 2026-04-20T13:57:55.700Z
Learning: In the pgEdge control-plane codebase, avoid using `DiffIgnore()` with JSON paths for fields that have been removed. `DiffIgnore()` suppresses diffs for the specified JSON field, which prevents the resource from being re-evaluated after a field is removed. For removed fields, do not add them to `DiffIgnore()`; instead let the diff occur naturally, relying on the idempotency of Update methods to handle the change safely.

Learnt from: jason-lynch
Repo: pgEdge/control-plane PR: 377
File: server/internal/monitor/provide.go:60-61
Timestamp: 2026-05-05T12:30:06.386Z
Learning: In pgEdge/control-plane, the storage layer (e.g., WatchOp) uses deferred execution: etcd connections are not created in constructors/initializers. When reviewing Go code, do not report “etcd resource leak” for code paths that merely construct candidate/watch/election objects (e.g., via electionSvc.NewCandidate) as long as the relevant execution method (Exec/Start or equivalent) is not called afterward. Only consider it a potential leak if the code actually starts/executes the watcher/candidate and fails to shut it down (if the project has a corresponding stop/close path).

@tsivaprasad tsivaprasad marked this pull request as draft May 14, 2026 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant