Reconciliation plan#13830
Conversation
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
String() is designed to make it easy to compare coomputed plan vs expectations Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
- Remove the `convergence` struct and `newConvergence` constructor - Extract `resolveServiceReferences` as a standalone function taking `map[string]Containers` instead of a method on convergence - Add `getContainersByService` helper on composeService - Update run.go and executor.go to use the new standalone function - Remove dead code: `getObservedState`, `setObservedState` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
…remove - Fix ContainerReplaceLabel detection: use op.Inherited != nil (not op.Container) as signal for recreate in execCreateContainer - Use observed network name (not desired) for DisconnectNetwork and RemoveNetwork operations, in case the name changed - Use observed volume name (not desired) for RemoveVolume operations - Update reconciliation.md with 3 new lessons learned (7.8, 7.9, 7.10) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
- plan.go: pin OperationType constants to explicit values so adding an op in the middle doesn't shift the others. - executor.go: remove the meaningless `var _ = getContainerProgressName` line — same-package functions are always accessible. - reconcile.go: fix the stale switch-default comment that contradicted the case clause above it. - reconcile.go: drop the local `serviceLabel` const that shadowed `api.ServiceLabel` and use the shared constant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
ensureProjectVolumes already prompts the user when a volume's config hash diverges from the compose file (create.go:1626) and recreates it on confirm. The reconciler ran after ensureProjectVolumes and prompted again with the exact same message — so a user who declined the first prompt was asked the same question a second time. Drop the prompt + recreate call from reconcileVolumes(). Recreation of diverged volumes stays owned by ensureProjectVolumes; the reconciler only plans the creation of missing volumes. If the user declined recreation, the existing container's mounts still match the existing volume name and hasVolumeMismatch correctly returns false, so containers are not falsely flagged as obsolete. Keep the supporting infrastructure available for future use, when divergence detection migrates fully into the reconciler: - reconciler.prompt field - prompt parameter on reconcile() - planRecreateVolume function (//nolint:unused) - servicesUsingVolume function (//nolint:unused) - noPrompt test helper The reframed test (TestReconcileVolumes_DivergedIsIgnored) asserts the new contract: a diverged volume produces no plan operations from the reconciler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
Two related changes to the executor, plus the small cleanups they attracted in review: * Rename node consults a planner-set CreateNodeID instead of walking ancestors. The old execRenameContainer searched node.DependsOn for a CreateContainer result and fell back to a recursive walk through the group chain. That worked only by convention; a future op with cross- node data needs would have to rediscover or copy the pattern. Now the rename op carries an explicit CreateNodeID int set at plan time and the executor reads pctx[CreateNodeID].ContainerID directly. The recursive findCreatedIDInChain is gone. * Stop re-listing containers on every create. execCreateContainer used to call getContainersByService(ctx, projectName) — a fresh ContainerList per create — to resolve service references at execute time. The executor now holds a live containersByService view seeded from ObservedState (via observed.containersByService()) and grown as OpCreateContainer nodes complete, so service references resolve from memory. On OpRemoveContainer the removed container is dropped from the view via slices.DeleteFunc, so a dependent's create that resolves network_mode: service:x against the just-removed container cannot pick up a stale ID (Containers.sorted() orders by canonical name and would otherwise return the removed container). * Defensive slices.Clone of op.Service.VolumesFrom in execCreateContainer. resolveServiceReferences mutates VolumesFrom in place, and the shallow struct copy of *op.Service still shares the backing array. Single-execution-per-node makes it safe today, but the clone removes the trap for any future parallel-execution mode. * Operation gains a CreateNodeID int (not a *PlanNode pointer) to avoid a structural cycle between Operation and PlanNode. OperationType values are pinned to explicit integers so adding an op in the middle cannot silently shift the others. * execRenameContainer carries two checks so a missing CreateNodeID and an empty produced ID are distinguishable in logs. Both are programmer invariants (prefixed "internal:"). * containersByServiceFromObserved moved from a package-level helper to a method on *ObservedState. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
Three improvements identified in the Principal Engineer pass but deliberately deferred: 1. Test fidelity. Split executePlan into newPlanExecutor (constructs the executor seeded from observed state) and (*planExecutor).run (walks the DAG). Production callers go through executePlan unchanged. TestExecutePlanRemoveContainerDropsFromCache now uses newPlanExecutor + run, exercising the same errgroup, done-channel and group-tracker wiring as production instead of a hand-rolled loop over executeNode. 2. //nolint:unused chain. The three preserved helpers (reconciler.prompt, planRecreateVolume, servicesUsingVolume) each carried a separate "kept for future" comment. Consolidate the rationale on the reconciler.prompt field doc and point the helper nolint directives there, so a future cleanup is a single grep. 3. Concurrency test. Add TestExecutePlanConcurrentRemovesCacheCoherence which builds N independent Stop→Remove chains in one plan; the errgroup fans them out across goroutines that all hit containersByService under the mutex. Passes under -race. Failure would expose a missing or incorrect lock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
executor.go had grown to ~480 lines mixing three concerns: - DAG orchestration (run, executeNode, planExecutor struct) - per-OpType implementations (execCreateNetwork, …, execRenameContainer) - event tracking (groupTracker + per-OpType emit helpers) Split into three files of ~170 lines each, one concern per file: - executor.go — planExecutor, reconciliationContext, run, executeNode dispatch - executor_ops.go — all execXxx methods - executor_events.go — groupTracker + emitStartEvent / emitDoneEvent / emitErrorEvent Pure refactor: no functional change. Tests (incl. -race) and lint both pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
Three fixes surfaced by an external code review of the new reconciler: 1. Scale-down now propagates through serviceNodes. When a service is scaled down (all containers in excess), reconcileService used to continue without assigning lastNode, leaving r.serviceNodes[svc] unset. Dependent services then declared no edge on the scale-down ops and could start before the cleanup finished. Track the RemoveContainer node as lastNode so depends_on chains pick it up. 2. mustRecreate errors are no longer silently ignored by sortContainers. The comparator used `obsi, _ := r.mustRecreate(...)`, falling back to false on any hashing error. Pre-compute obsolescence into a map keyed by container ID before sorting and propagate the error to reconcileService. 3. A container is no longer Stopped twice when its network and its config both diverge. planRecreateNetwork already stops the affected container as part of the disconnect/remove/recreate dance; the subsequent planRecreateContainer (triggered via hasNetworkMismatch) used to add another OpStopContainer against the now-stopped target. Track stops in r.stoppedByPlan; planRecreateContainer reuses an existing Stop node when present, and chains its Remove on both that Stop and the replacement Create. Two golden tests (TestReconcileNetworks_Diverged*) are updated to reflect the new, dedupe'd plan shape (one Stop instead of two per recreated container). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
Two coverage gaps surfaced by an external review: - TestReconcileContainers_DependsOnChain: asserts that a service B with depends_on: [A] produces a CreateContainer for B that depends on A's last plan node (the serviceNodes mechanism in infrastructureDeps). This was the only depends_on-via-plan-DAG behavior untested before. - TestReconcileContainers_DependsOnScaleDown: companion test that exercises the scale-down → dependent path specifically, verifying that the previous commit's lastNode-on-scale-down fix actually wires the dependency through. - TestOperationTypeString: adds OpRunProvider to the table; all other OperationType values were already covered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
External reviewer noted that the alreadyStopped branch adds createNode to removeDeps while the !alreadyStopped branch does not — semantically correct but fragile, since it relies on the implicit invariant that stopNode.DependsOn contains createNode in the !alreadyStopped path. Spell out the invariant in a comment so a future maintainer who edits the stop → create edge in the normal path knows they must also add createNode unconditionally in the remove deps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces a new “reconciliation plan” architecture for Compose create/up-like behavior by snapshotting the current project state, computing a DAG plan of atomic operations (networks/volumes/containers/providers), and executing that plan via a dedicated executor. This is positioned as an alternative approach to PR #13641 to better guide a significant refactor and enable more deterministic, focused tests.
Changes:
- Added a reconciler that compares desired vs observed state and produces a deterministic operation DAG (
Plan). - Added a plan executor that runs the DAG with dependency ordering, concurrency, and grouped progress events (e.g., recreate).
- Updated create/run flows to use observed-state collection and service-reference resolution without the previous convergence implementation; added extensive unit tests for plan/reconcile/executor behavior.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/compose/run.go | Switches run service-reference resolution to use a containers-by-service snapshot helper. |
| pkg/compose/reconcile.go | Implements the reconciler that builds a plan (DAG) from desired vs observed state. |
| pkg/compose/reconcile_test.go | Adds unit tests covering network/volume/container/orphan reconciliation scenarios and deterministic plan output. |
| pkg/compose/plan.go | Introduces Plan/PlanNode and Operation types (the reconciliation DAG model). |
| pkg/compose/plan_test.go | Adds tests for operation stringification and plan rendering/dependency formatting. |
| pkg/compose/observed_state.go | Adds ObservedState snapshot collection and helpers used by reconciliation/execution. |
| pkg/compose/observed_state_test.go | Adds tests validating observed-state extraction/classification from mocked API responses. |
| pkg/compose/executor.go | Introduces the DAG executor that schedules node execution and coordinates group events. |
| pkg/compose/executor_test.go | Adds executor tests, including cache-coherence under concurrent remove operations. |
| pkg/compose/executor_ops.go | Implements concrete Docker API calls for each operation type and maintains the live service-reference cache. |
| pkg/compose/executor_events.go | Adds group-based progress event aggregation (e.g., “Recreate” Working/Done). |
| pkg/compose/create.go | Replaces convergence-based apply with observed snapshot + reconcile(plan) + executePlan, and preserves “running” events. |
| pkg/compose/convergence.go | Removes the convergence orchestration and refactors service-reference resolution into standalone helpers. |
| pkg/compose/containers.go | Adds getContainersByService helper; removes an unused Containers.names() helper. |
| err := exec.executeNode(ctx, node) | ||
|
|
||
| if err == nil { | ||
| // Emit group done event if this is the last node of a group | ||
| groups.onNodeDone(node, events) | ||
| } else if ctx.Err() == nil { | ||
| groups.onNodeError(node, events, err) | ||
| } | ||
|
|
||
| close(done[node.ID]) | ||
| return err | ||
| }) |
|
❌ PR Review Failed — The review agent encountered an error and could not complete the review. View logs. |
Branch protection rule check failed
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
What I did
this is an alternative for #13641
based on previous work, I first generated an analysis an plan document, so AI work on this significant refactoring is better guided and impact is well understood
(not mandatory) A picture of a cute animal, if possible in relation to what you did