Skip to content

Fix GAgent durable terminal completion#657

Open
louis4li wants to merge 3 commits into
devfrom
fix/2026-05-14_gagent-durable-terminal-completion
Open

Fix GAgent durable terminal completion#657
louis4li wants to merge 3 commits into
devfrom
fix/2026-05-14_gagent-durable-terminal-completion

Conversation

@louis4li
Copy link
Copy Markdown
Contributor

Summary

Fixes #370 by replacing the placeholder durable completion resolvers for GAgent draft-run and approval interactions with a committed-fact -> projection readmodel -> query-port path.

This PR adds a GAgent run terminal readmodel and durable materialization scope that observes committed RoleChatSessionCompletedEvent facts, then maps the resulting terminal snapshot back into draft-run / approval durable completion. It also fixes the approval live observation session key so CommandId != CorrelationId still receives the continuation terminal event.

What changed

  • Added GAgentRunTerminalSnapshot, GAgentRunTerminalStatus, GAgentRunTerminalInteractionKind, IGAgentRunTerminalQueryPort, and IGAgentRunTerminalProjectionPort.
  • Added GAgentRunTerminalReadModel, metadata provider, projector, query reader, and durable projection activation port.
  • Activated terminal durable materialization before draft-run / approval dispatch.
  • Updated draft-run and approval durable resolvers to query by actorId + correlationId, then fallback to actorId + sessionId where applicable.
  • Extended approval receipts with SessionId for fallback lookup.
  • Persisted approval failure terminal facts for denied / timeout / remote failure paths so missed-live-event recovery has a committed source.
  • Made durable materialization release session-aware through IProjectionSessionScopedMaterializationContext, so session/correlation-scoped terminal materialization can be released with the same key used for activation.
  • Added design notes under docs/history/2026-05/ documenting the architecture and follow-up decisions.

Explicitly deferred

Two review points are intentionally not included in this PR:

  1. Replacing legacy approval / LLM failure markers with a typed terminal event

    • Current implementation treats RoleChatSessionCompletedEvent.Content markers as a legacy fallback only.
    • A proper typed terminal event or typed status/reason_code/reason_message proto fields would change the committed event contract and should be designed/migrated separately.
    • The PR documents that live-only terminal events and marker strings are not the long-term durable protocol.
  2. Changing ReasonCode from string to enum

    • ReasonCode is currently used for durable readmodel reason recording and completion mapping, not as a stable cross-module control-flow or filtering contract.
    • Keeping it as string avoids freezing an enum while legacy marker compatibility still exists.
    • If terminal reason later becomes API/SDK/filter/control-flow surface, it should move together with the typed terminal event work to proto enum / typed code.

Validation

Previously run on this branch:

dotnet test test/Aevatar.CQRS.Projection.Core.Tests/Aevatar.CQRS.Projection.Core.Tests.csproj --nologo --filter "ProjectionRuntimeRegistrationTests"
dotnet test test/Aevatar.GAgentService.Tests/Aevatar.GAgentService.Tests.csproj --nologo --filter "GAgentDraftRunInteraction|GAgentApprovalInteraction|GAgentRunTerminal"
dotnet test test/Aevatar.GAgentService.Integration.Tests/Aevatar.GAgentService.Integration.Tests.csproj --nologo --filter "ScopeServiceEndpointsStreamTests|GAgentServiceHostingServiceCollectionExtensionsTests"
dotnet test test/Aevatar.AI.Tests/Aevatar.AI.Tests.csproj --nologo --filter "RoleGAgentStateCoverageTests"
bash tools/ci/test_stability_guards.sh
bash tools/ci/query_projection_priming_guard.sh
bash tools/ci/projection_state_version_guard.sh
bash tools/ci/projection_state_mirror_current_state_guard.sh
bash tools/ci/projection_route_mapping_guard.sh
git diff --check
dotnet build aevatar.slnx --nologo

Also started the GAgent service locally with in-memory projection/document/graph providers and auth disabled, then exercised related scope service endpoints. The RoleGAgent draft-run failure path committed RoleChatSessionCompletedEvent and wrote GAgentRunTerminalReadModel successfully.

Notes

Full build succeeds with existing warnings from NuGet source configuration, vulnerability advisories, obsolete APIs, complexity, and nullability warnings.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

❌ Patch coverage is 76.68539% with 83 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.14%. Comparing base (e77e3e8) to head (aee3b13).

Files with missing lines Patch % Lines
...n/Orchestration/GAgentRunTerminalProjectionPort.cs 0.00% 17 Missing ⚠️
...rojection/Projectors/GAgentRunTerminalProjector.cs 80.23% 7 Missing and 10 partials ⚠️
...lication/ScopeGAgents/GAgentApprovalInteraction.cs 76.08% 7 Missing and 4 partials ⚠️
...lication/ScopeGAgents/GAgentDraftRunInteraction.cs 77.08% 7 Missing and 4 partials ⚠️
...Projection/Queries/GAgentRunTerminalQueryReader.cs 86.36% 4 Missing and 5 partials ⚠️
...DependencyInjection/ServiceCollectionExtensions.cs 57.14% 6 Missing ⚠️
src/Aevatar.AI.Core/RoleGAgent.cs 88.09% 2 Missing and 3 partials ⚠️
...stractions/ScopeGAgents/GAgentRunTerminalModels.cs 63.63% 4 Missing ⚠️
.../ReadModels/ServiceProjectionReadModels.Partial.cs 50.00% 2 Missing ⚠️
...ion/Contexts/GAgentRunTerminalProjectionContext.cs 75.00% 1 Missing ⚠️
@@            Coverage Diff             @@
##              dev     #657      +/-   ##
==========================================
+ Coverage   72.13%   72.14%   +0.01%     
==========================================
  Files        1261     1267       +6     
  Lines       91065    91405     +340     
  Branches    11924    11964      +40     
==========================================
+ Hits        65690    65945     +255     
- Misses      20675    20733      +58     
- Partials     4700     4727      +27     
Flag Coverage Δ
ci 72.14% <76.68%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...on/ProjectionMaterializationRuntimeRegistration.cs 100.00% <100.00%> (ø)
....Abstractions/ScopeGAgents/GAgentDraftRunModels.cs 92.50% <100.00%> (+0.39%) ⬆️
...DependencyInjection/ServiceCollectionExtensions.cs 96.95% <100.00%> (+0.05%) ⬆️
...data/GAgentRunTerminalReadModelMetadataProvider.cs 100.00% <100.00%> (ø)
...ion/Contexts/GAgentRunTerminalProjectionContext.cs 75.00% <75.00%> (ø)
.../ReadModels/ServiceProjectionReadModels.Partial.cs 97.18% <50.00%> (-2.82%) ⬇️
...stractions/ScopeGAgents/GAgentRunTerminalModels.cs 63.63% <63.63%> (ø)
src/Aevatar.AI.Core/RoleGAgent.cs 81.10% <88.09%> (+0.36%) ⬆️
...DependencyInjection/ServiceCollectionExtensions.cs 67.78% <57.14%> (-1.11%) ⬇️
...Projection/Queries/GAgentRunTerminalQueryReader.cs 86.36% <86.36%> (ø)
... and 4 more

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@louis4li louis4li marked this pull request as ready for review May 15, 2026 03:11
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 274df47079

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +455 to +456
if (snapshot == null && !string.IsNullOrWhiteSpace(receipt.SessionId))
snapshot = await _queryPort.GetBySessionIdAsync(receipt.ActorId, receipt.SessionId, ct);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard session fallback with the active correlation

When correlation lookup misses, this resolver accepts the first snapshot for actorId + sessionId without checking that the returned snapshot belongs to the current receipt correlation. If the same session has an older terminal document (common when users reuse a session/thread), a transient miss on the new correlation document will cause this command to be marked terminal from stale data. This can produce incorrect completion status for the wrong run; the same fallback pattern is also present in GAgentApprovalDurableCompletionResolver.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in aee3b13. The durable resolvers now only accept correlation or session fallback snapshots when the snapshot matches the active receipt actor/correlation and the expected interaction kind (draft-run vs approval), so reused sessions cannot resolve from stale terminal documents. Added unit coverage for stale correlation, wrong interaction kind, and valid fallback for both draft-run and approval.

Only accept session fallback snapshots when they match the active receipt correlation and interaction kind, preventing reused sessions from resolving with stale terminal state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LastEventId = eventId,
SessionId = completed.SessionId,
CorrelationId = correlationId,
InteractionKind = (int)ResolveInteractionKind(completed.SessionId, correlationId),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Consensus: 3 models] severity=major, category=bug

ResolveInteractionKind infers Approval whenever sessionId != correlationId, but GAgentDraftRunCommandTargetBinder.ResolveSessionId returns command.SessionId.Trim() when the caller supplies one — so any draft-run with an explicit SessionId distinct from CorrelationId will be written into the readmodel as Approval. This makes InteractionKind unreliable for any downstream consumer (filtering, dashboards, future typed terminal-event mapping).

Fix: carry the interaction kind through the projection scope/context (e.g. add InteractionKind to GAgentRunTerminalProjectionContext from the activation site at GAgentDraftRunInteraction.cs / GAgentApprovalInteraction.cs) and read it from the context here, instead of inferring from session/correlation equality.

Per-model verbatim
  • codex: ResolveInteractionKind infers approval whenever SessionId != CorrelationId, but draft-run supports an explicit SessionId, so those draft-run terminal snapshots will be mislabeled as Approval. Carry the interaction kind through the activation context/port or store separate projection kinds instead of deriving it from the session/correlation equality.
  • gemini: The ResolveInteractionKind logic assumes DraftRun always has SessionId == CorrelationId. If a user provides a custom SessionId in the GAgentDraftRunCommand, this will incorrectly label the interaction as Approval. While non-critical for completion logic, it makes the InteractionKind field in the read model potentially misleading.
  • kimi: ResolveInteractionKind uses the heuristic sessionId == correlationId to distinguish DraftRun from Approval. A draft-run with an explicit SessionId (different from CorrelationId) would be misclassified as Approval. Consider documenting this assumption or deriving the kind from explicit event metadata when a typed terminal event is introduced.

CancellationToken ct = default);
}

public interface IGAgentRunTerminalProjectionPort
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Consensus: 2 models] severity=major, category=arch

IGAgentRunTerminalProjectionPort only exposes EnsureProjectionAsync (correlation-scoped) with no corresponding ReleaseProjectionAsync / lease return. Each draft-run / approval command currently activates a new correlation-keyed durable materialization scope (GAgentApprovalInteraction.cs:181, GAgentDraftRunInteraction.cs:236) and never releases it. Since correlation is per-command, these scopes accumulate until Orleans deactivation — defeating the point of the new IProjectionSessionScopedMaterializationContext release-aware design described in the PR summary.

Fix: return a lease (analogous to IGAgentDraftRunProjectionLease) from EnsureProjectionAsync, and have the interaction cleanup path (GAgentApprovalCommandTarget.ReleaseAsync / draft-run equivalent) release it. Alternatively, make this an actor-scoped (not correlation-scoped) materialization.

Per-model verbatim
  • codex (GAgentRunTerminalProjectionPort.cs:33): This activates a durable materialization scope keyed by correlation id, then discards the lease and exposes no release path. Because draft-run/approval create a new correlation per command, these projection actors/subscriptions can accumulate without cleanup; return a terminal projection lease from the port and release it from the interaction cleanup path, or make this an actor-scoped materialization instead of a per-correlation scope.
  • gemini (GAgentRunTerminalModels.cs:43): The interface is missing a corresponding ReleaseProjectionAsync or ReleaseIfIdleAsync method. Given the PR summary explicitly mentions making durable materialization 'release session-aware', the port should expose this capability to allow the interaction service to clean up command-scoped materializers.
  • gemini (GAgentApprovalInteraction.cs:157): The terminal projection is ensured here but never released. Since these materialization actors are now partitioned by CorrelationId (one per command), they will leak until Orleans deactivates them. The GAgentApprovalCommandTarget should hold a reference to the IGAgentRunTerminalProjectionPort and call a (yet to be exposed) release method in its ReleaseAsync implementation, similar to how the live _projectionPort is handled.

{
var snapshot = await _queryPort.GetByCorrelationIdAsync(receipt.ActorId, receipt.CorrelationId, ct);
if (snapshot == null && !string.IsNullOrWhiteSpace(receipt.SessionId))
snapshot = await _queryPort.GetBySessionIdAsync(receipt.ActorId, receipt.SessionId, ct);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[codex] severity=major, category=bug

The if (snapshot == null && !string.IsNullOrWhiteSpace(receipt.SessionId)) fallback accepts whatever GetBySessionIdAsync returns without checking that the returned snapshot's CorrelationId matches receipt.CorrelationId. If a session/thread is reused (which is the whole point of command.SessionId) and the correlation document is temporarily missing or evicted, this can complete the current command from an older terminal fact — silently delivering stale completion status.

Same pattern exists in GAgentApprovalDurableCompletionResolver (GAgentApprovalInteraction.cs:343-345).

Fix: only accept the session-fallback snapshot when its CorrelationId == receipt.CorrelationId, otherwise return Incomplete:

if (snapshot != null &&
    !string.Equals(snapshot.CorrelationId, receipt.CorrelationId, StringComparison.Ordinal))
    snapshot = null;

Apply in both resolvers.

@@ -151,6 +159,12 @@ public async Task HandleToolApprovalDecision(ToolApprovalDecisionEvent evt)
else
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[kimi] severity=major, category=bug

The approval-denied else branch calls PersistApprovalTerminalFailureAsync without a try / catch. If this throws after ClearPendingApprovalEvent is already persisted, the actor event fails and retries; on retry State.PendingApproval is already null so the method returns early, permanently losing the terminal failure fact — exactly what the durable readmodel needs.

The same gap exists in two more call sites in this file:

  • HandleToolApprovalTimeout no-remote-handler branch (around line 195)
  • HandleToolApprovalTimeout remote-failed/denied/timed-out branch (around line 249)

Only the exception path at lines 143-150 wraps it in try { … } catch { /* best effort */ }.

Fix: either wrap each call site in best-effort try/catch (matching the pattern at lines 143-150), or move the try/catch inside PersistApprovalTerminalFailureAsync itself so every caller is safe.

@@ -0,0 +1,227 @@
using Aevatar.AI.Abstractions;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[mimo-v2.5-pro] severity=major, category=test

Missing test for monotonic version guard. CLAUDE.md projection rules require: "旧 StateVersion 不覆盖新 readmodel". Add a test that sends two committed events with the same actorId + correlationId but a descending stateVersion, then asserts the readmodel retains the higher version (or that the projector / IProjectionWriteDispatcher rejects the stale write).

Without this test, a future change to _writeDispatcher.UpsertAsync semantics (or a parallel write path) could regress the monotonic guarantee without anyone noticing — the rest of the test suite only exercises forward-version cases.

try { await PersistDomainEventAsync(new ClearPendingApprovalEvent { RequestId = pending.RequestId }); }
catch { /* best effort */ }
try
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[gemini] severity=nit, category=other

This try { … } catch { /* best effort */ } swallows persistence failures silently. If PersistApprovalTerminalFailureAsync fails here, the durable resolver will stay Incomplete indefinitely with no operator signal. Consider logging at warning level inside the catch (include pending.SessionId, pending.RequestId, the reason code, and ex.Message) so this is observable.

Same applies to the symmetric catch the kimi comment proposes adding at the denied / timeout sites — when those wraps are added, give them logging too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement durable terminal completion resolution for draft-run and approval interactions

2 participants