Fix GAgent durable terminal completion#657
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 274df47079
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (snapshot == null && !string.IsNullOrWhiteSpace(receipt.SessionId)) | ||
| snapshot = await _queryPort.GetBySessionIdAsync(receipt.ActorId, receipt.SessionId, ct); |
There was a problem hiding this comment.
Guard session fallback with the active correlation
When correlation lookup misses, this resolver accepts the first snapshot for actorId + sessionId without checking that the returned snapshot belongs to the current receipt correlation. If the same session has an older terminal document (common when users reuse a session/thread), a transient miss on the new correlation document will cause this command to be marked terminal from stale data. This can produce incorrect completion status for the wrong run; the same fallback pattern is also present in GAgentApprovalDurableCompletionResolver.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in aee3b13. The durable resolvers now only accept correlation or session fallback snapshots when the snapshot matches the active receipt actor/correlation and the expected interaction kind (draft-run vs approval), so reused sessions cannot resolve from stale terminal documents. Added unit coverage for stale correlation, wrong interaction kind, and valid fallback for both draft-run and approval.
Only accept session fallback snapshots when they match the active receipt correlation and interaction kind, preventing reused sessions from resolving with stale terminal state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| LastEventId = eventId, | ||
| SessionId = completed.SessionId, | ||
| CorrelationId = correlationId, | ||
| InteractionKind = (int)ResolveInteractionKind(completed.SessionId, correlationId), |
There was a problem hiding this comment.
[Consensus: 3 models] severity=major, category=bug
ResolveInteractionKind infers Approval whenever sessionId != correlationId, but GAgentDraftRunCommandTargetBinder.ResolveSessionId returns command.SessionId.Trim() when the caller supplies one — so any draft-run with an explicit SessionId distinct from CorrelationId will be written into the readmodel as Approval. This makes InteractionKind unreliable for any downstream consumer (filtering, dashboards, future typed terminal-event mapping).
Fix: carry the interaction kind through the projection scope/context (e.g. add InteractionKind to GAgentRunTerminalProjectionContext from the activation site at GAgentDraftRunInteraction.cs / GAgentApprovalInteraction.cs) and read it from the context here, instead of inferring from session/correlation equality.
Per-model verbatim
- codex:
ResolveInteractionKindinfers approval wheneverSessionId != CorrelationId, but draft-run supports an explicitSessionId, so those draft-run terminal snapshots will be mislabeled asApproval. Carry the interaction kind through the activation context/port or store separate projection kinds instead of deriving it from the session/correlation equality. - gemini: The
ResolveInteractionKindlogic assumesDraftRunalways hasSessionId == CorrelationId. If a user provides a customSessionIdin theGAgentDraftRunCommand, this will incorrectly label the interaction asApproval. While non-critical for completion logic, it makes theInteractionKindfield in the read model potentially misleading. - kimi:
ResolveInteractionKinduses the heuristicsessionId == correlationIdto distinguish DraftRun from Approval. A draft-run with an explicitSessionId(different fromCorrelationId) would be misclassified as Approval. Consider documenting this assumption or deriving the kind from explicit event metadata when a typed terminal event is introduced.
| CancellationToken ct = default); | ||
| } | ||
|
|
||
| public interface IGAgentRunTerminalProjectionPort |
There was a problem hiding this comment.
[Consensus: 2 models] severity=major, category=arch
IGAgentRunTerminalProjectionPort only exposes EnsureProjectionAsync (correlation-scoped) with no corresponding ReleaseProjectionAsync / lease return. Each draft-run / approval command currently activates a new correlation-keyed durable materialization scope (GAgentApprovalInteraction.cs:181, GAgentDraftRunInteraction.cs:236) and never releases it. Since correlation is per-command, these scopes accumulate until Orleans deactivation — defeating the point of the new IProjectionSessionScopedMaterializationContext release-aware design described in the PR summary.
Fix: return a lease (analogous to IGAgentDraftRunProjectionLease) from EnsureProjectionAsync, and have the interaction cleanup path (GAgentApprovalCommandTarget.ReleaseAsync / draft-run equivalent) release it. Alternatively, make this an actor-scoped (not correlation-scoped) materialization.
Per-model verbatim
- codex (GAgentRunTerminalProjectionPort.cs:33): This activates a durable materialization scope keyed by correlation id, then discards the lease and exposes no release path. Because draft-run/approval create a new correlation per command, these projection actors/subscriptions can accumulate without cleanup; return a terminal projection lease from the port and release it from the interaction cleanup path, or make this an actor-scoped materialization instead of a per-correlation scope.
- gemini (GAgentRunTerminalModels.cs:43): The interface is missing a corresponding
ReleaseProjectionAsyncorReleaseIfIdleAsyncmethod. Given the PR summary explicitly mentions making durable materialization 'release session-aware', the port should expose this capability to allow the interaction service to clean up command-scoped materializers. - gemini (GAgentApprovalInteraction.cs:157): The terminal projection is ensured here but never released. Since these materialization actors are now partitioned by
CorrelationId(one per command), they will leak until Orleans deactivates them. TheGAgentApprovalCommandTargetshould hold a reference to theIGAgentRunTerminalProjectionPortand call a (yet to be exposed) release method in itsReleaseAsyncimplementation, similar to how the live_projectionPortis handled.
| { | ||
| var snapshot = await _queryPort.GetByCorrelationIdAsync(receipt.ActorId, receipt.CorrelationId, ct); | ||
| if (snapshot == null && !string.IsNullOrWhiteSpace(receipt.SessionId)) | ||
| snapshot = await _queryPort.GetBySessionIdAsync(receipt.ActorId, receipt.SessionId, ct); |
There was a problem hiding this comment.
[codex] severity=major, category=bug
The if (snapshot == null && !string.IsNullOrWhiteSpace(receipt.SessionId)) fallback accepts whatever GetBySessionIdAsync returns without checking that the returned snapshot's CorrelationId matches receipt.CorrelationId. If a session/thread is reused (which is the whole point of command.SessionId) and the correlation document is temporarily missing or evicted, this can complete the current command from an older terminal fact — silently delivering stale completion status.
Same pattern exists in GAgentApprovalDurableCompletionResolver (GAgentApprovalInteraction.cs:343-345).
Fix: only accept the session-fallback snapshot when its CorrelationId == receipt.CorrelationId, otherwise return Incomplete:
if (snapshot != null &&
!string.Equals(snapshot.CorrelationId, receipt.CorrelationId, StringComparison.Ordinal))
snapshot = null;Apply in both resolvers.
| @@ -151,6 +159,12 @@ public async Task HandleToolApprovalDecision(ToolApprovalDecisionEvent evt) | |||
| else | |||
There was a problem hiding this comment.
[kimi] severity=major, category=bug
The approval-denied else branch calls PersistApprovalTerminalFailureAsync without a try / catch. If this throws after ClearPendingApprovalEvent is already persisted, the actor event fails and retries; on retry State.PendingApproval is already null so the method returns early, permanently losing the terminal failure fact — exactly what the durable readmodel needs.
The same gap exists in two more call sites in this file:
HandleToolApprovalTimeoutno-remote-handler branch (around line 195)HandleToolApprovalTimeoutremote-failed/denied/timed-out branch (around line 249)
Only the exception path at lines 143-150 wraps it in try { … } catch { /* best effort */ }.
Fix: either wrap each call site in best-effort try/catch (matching the pattern at lines 143-150), or move the try/catch inside PersistApprovalTerminalFailureAsync itself so every caller is safe.
| @@ -0,0 +1,227 @@ | |||
| using Aevatar.AI.Abstractions; | |||
There was a problem hiding this comment.
[mimo-v2.5-pro] severity=major, category=test
Missing test for monotonic version guard. CLAUDE.md projection rules require: "旧 StateVersion 不覆盖新 readmodel". Add a test that sends two committed events with the same actorId + correlationId but a descending stateVersion, then asserts the readmodel retains the higher version (or that the projector / IProjectionWriteDispatcher rejects the stale write).
Without this test, a future change to _writeDispatcher.UpsertAsync semantics (or a parallel write path) could regress the monotonic guarantee without anyone noticing — the rest of the test suite only exercises forward-version cases.
| try { await PersistDomainEventAsync(new ClearPendingApprovalEvent { RequestId = pending.RequestId }); } | ||
| catch { /* best effort */ } | ||
| try | ||
| { |
There was a problem hiding this comment.
[gemini] severity=nit, category=other
This try { … } catch { /* best effort */ } swallows persistence failures silently. If PersistApprovalTerminalFailureAsync fails here, the durable resolver will stay Incomplete indefinitely with no operator signal. Consider logging at warning level inside the catch (include pending.SessionId, pending.RequestId, the reason code, and ex.Message) so this is observable.
Same applies to the symmetric catch the kimi comment proposes adding at the denied / timeout sites — when those wraps are added, give them logging too.
Summary
Fixes #370 by replacing the placeholder durable completion resolvers for GAgent draft-run and approval interactions with a committed-fact -> projection readmodel -> query-port path.
This PR adds a GAgent run terminal readmodel and durable materialization scope that observes committed
RoleChatSessionCompletedEventfacts, then maps the resulting terminal snapshot back into draft-run / approval durable completion. It also fixes the approval live observation session key soCommandId != CorrelationIdstill receives the continuation terminal event.What changed
GAgentRunTerminalSnapshot,GAgentRunTerminalStatus,GAgentRunTerminalInteractionKind,IGAgentRunTerminalQueryPort, andIGAgentRunTerminalProjectionPort.GAgentRunTerminalReadModel, metadata provider, projector, query reader, and durable projection activation port.actorId + correlationId, then fallback toactorId + sessionIdwhere applicable.SessionIdfor fallback lookup.IProjectionSessionScopedMaterializationContext, so session/correlation-scoped terminal materialization can be released with the same key used for activation.docs/history/2026-05/documenting the architecture and follow-up decisions.Explicitly deferred
Two review points are intentionally not included in this PR:
Replacing legacy approval / LLM failure markers with a typed terminal event
RoleChatSessionCompletedEvent.Contentmarkers as a legacy fallback only.status/reason_code/reason_messageproto fields would change the committed event contract and should be designed/migrated separately.Changing
ReasonCodefrom string to enumReasonCodeis currently used for durable readmodel reason recording and completion mapping, not as a stable cross-module control-flow or filtering contract.Validation
Previously run on this branch:
Also started the GAgent service locally with in-memory projection/document/graph providers and auth disabled, then exercised related scope service endpoints. The RoleGAgent draft-run failure path committed
RoleChatSessionCompletedEventand wroteGAgentRunTerminalReadModelsuccessfully.Notes
Full build succeeds with existing warnings from NuGet source configuration, vulnerability advisories, obsolete APIs, complexity, and nullability warnings.