Skip to content

Virtual Context bug on parent id and FAIL checkpoint emitting#364

Merged
yaythomas merged 2 commits intomainfrom
refactor/virtual-context
Apr 30, 2026
Merged

Virtual Context bug on parent id and FAIL checkpoint emitting#364
yaythomas merged 2 commits intomainfrom
refactor/virtual-context

Conversation

@yaythomas
Copy link
Copy Markdown
Contributor

@yaythomas yaythomas commented Apr 29, 2026

Description of changes:

Fixes two functional bugs in virtual child contexts and refactors the supporting mechanism.

#363run_in_child_context(is_virtual=True) sends invalid parent_id to the backend. On main, inner operations inside a virtual run_in_child_context stamp the branch's own operation id as their parent_id. The branch itself writes no START/SUCCEED under is_virtual=True, so that id does not exist in the execution history. The backend correctly rejects the checkpoint with InvalidParameterValueException(CHECKPOINT_INVALID_PARENT_OPERATION_ID), and the user's execution fails at the first inner checkpoint. Nesting virtual scopes (and FLAT map/parallel inside a virtual scope) compounds the failure. After this change, inner operations stamp the enclosing non-virtual ancestor, which is a real checkpointed parent.

#362 — virtual branches wrote a FAIL checkpoint on exceptions. On main, an exception inside a virtual branch caused ChildOperationExecutor.execute to unconditionally write a FAIL checkpoint for the branch — producing a phantom FAILED CONTEXT entry in the execution history with no matching START. On the FLAT map/parallel path this was cosmetic (the FAIL's parent id was the valid map/parallel op, the backend accepts a CONTEXT FAIL without a prior START); on the run_in_child_context(is_virtual=True) path it compounded #363 (the FAIL's parent id was the same orphan branch id, which the backend rejects). Either way it's incorrect: virtual branches should emit no lifecycle entries at all. After this change, virtual branches emit no lifecycle checkpoints; failures still propagate to the concurrency executor's BatchResult (or to the caller of run_in_child_context), and completion-tolerance logic still applies.

Mechanism changes (applied alongside the fixes):

  • Single source of truth for the virtual-vs-real decision. create_child_context computes two fields (_parent_id, _step_id_prefix) at construction; no per-operation-method knowledge is required. New operations just read self._parent_id and work correctly under both modes.
  • Field names match their roles. _parent_id is "the id my inner operations stamp as their parent"; _step_id_prefix is "how I prefix step ids". Each field has one job.
  • is_virtual is encapsulated in the context as a cached property. Callers opt in with create_child_context(..., is_virtual=True); the property makes the state inspectable.
  • Nested virtual-in-virtual matches the JS reference. A virtual child of a virtual parent inherits its parent's reporting ancestor, so chained FLAT layers collapse to the outermost non-virtual context without dangling parent-id references.
  • ChildConfig.is_virtual drives lifecycle-checkpoint suppression in ChildOperationExecutor (START, SUCCEED, FAIL) and, via run_in_child_context, the child context's own virtual-ness. The field remains public, matching the JS SDK's ChildConfig.virtualContext.
  • Fewer parameters threaded through. operation_identifier is gone from ConcurrentExecutor, MapExecutor, and ParallelExecutor constructors and from_items/from_callables; the concurrency layer no longer needs an OperationIdentifier to figure out the reporting parent.
  • Tests exercise the invariants directly with a real DurableContext and assert wire-format decisions, including nested-virtual-in-virtual coverage.

Impact:

  • Users can call ctx.run_in_child_context(config=ChildConfig(is_virtual=True)) without their execution being rejected at checkpoint time.
  • FLAT map/parallel no longer emits phantom FAILED CONTEXT entries on branch failure.
  • Python SDK virtual-context semantics now match the JS reference SDK exactly (same parent-id propagation rule; nested virtuals collapse identically).
  • Cost: eliminates billable FAILED CONTEXT entries that were previously emitted for failed virtual branches.

Also includes minor CONTRIBUTING.md improvement (hatch venv symlink tip for editors that mangle paths with spaces).

Issue #, if available:

Fixes #362, Fixes #363

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Virtual child contexts (FLAT-mode map/parallel branches) no longer
write FAIL checkpoints when the user function raises. The branch is
a logical scope only; it does not appear in the execution history
regardless of outcome, aligning with the JS reference SDK.

Also fixes an incoherent state when a user set
ChildConfig.is_virtual=True via run_in_child_context: lifecycle
checkpoints were suppressed, but the child context's _parent_id was
the child's own operation id (never announced in the checkpoint
stream), so inner operations stamped a parent_id pointing to a
dangling reference. Nesting produced a chain of such references.
The two decisions (lifecycle suppression, parent-id propagation) are
now coupled through ChildConfig.is_virtual.

Refactor of the supporting mechanism:

- Single source of truth for the virtual-vs-real decision.
  create_child_context computes two fields (_parent_id,
  _step_id_prefix) at construction; no per-operation-method
  knowledge is required. New operations just read self._parent_id
  and work correctly under both modes.
- Field names match their roles. _parent_id is "the id my inner
  operations stamp as their parent"; _step_id_prefix is "how I
  prefix step ids". Each field has one job.
- is_virtual is encapsulated in the context as a cached property.
  Callers opt in with create_child_context(..., is_virtual=True);
  the property makes the state inspectable.
- Nested virtual-in-virtual matches the JS reference. A virtual
  child of a virtual parent inherits its parent's reporting
  ancestor, so chained FLAT layers collapse to the outermost
  non-virtual context without dangling parent-id references.
- ChildConfig.is_virtual drives lifecycle-checkpoint suppression in
  ChildOperationExecutor (START, SUCCEED, FAIL) and, via
  run_in_child_context, the child context's own virtual-ness. The
  field remains public, matching the JS SDK's
  ChildConfig.virtualContext.
- Fewer parameters threaded through. operation_identifier is gone
  from ConcurrentExecutor, MapExecutor, and ParallelExecutor
  constructors and from_items/from_callables; the concurrency
  layer no longer needs an OperationIdentifier to figure out the
  reporting parent.
- Tests exercise the invariants directly with a real DurableContext
  and assert wire-format decisions, including nested-virtual-in-
  virtual coverage.

Improves observability (no phantom FAILED CONTEXT entries for
virtual branches), cost (no billable operation per failed virtual
branch), and cross-SDK wire parity.

fixes #362, fixes #363
Kiro and VS Code mangle the hatch interpreter path when it contains
spaces, breaking "Select Interpreter". Document the `.venv` symlink
workaround and split the existing VS Code section into Interpreter
and Linting subsections.
Comment thread src/aws_durable_execution_sdk_python/config.py
@yaythomas yaythomas merged commit 72a2ba4 into main Apr 30, 2026
66 of 67 checks passed
@yaythomas yaythomas deleted the refactor/virtual-context branch April 30, 2026 17:35
@github-project-automation github-project-automation Bot moved this from In review to Done in aws-durable-execution Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

virtual child context orphaned parent_ids in execution history Virtual child contexts emit FAIL checkpoints in FLAT mode

3 participants