Skip to content

fix(batch): cancel in-flight tasks when InfrahubBatch.execute() exits#1002

Open
PhillSimonds wants to merge 1 commit into
developfrom
phill-fix-batch-orphan-tasks
Open

fix(batch): cancel in-flight tasks when InfrahubBatch.execute() exits#1002
PhillSimonds wants to merge 1 commit into
developfrom
phill-fix-batch-orphan-tasks

Conversation

@PhillSimonds
Copy link
Copy Markdown
Contributor

Why

InfrahubBatch.execute() orphans in-flight tasks when one task raises with return_exceptions=False. Sibling work continues silently on the event loop; the caller has been told the batch failed and cannot retry; failed siblings surface only as Task exception was never retrieved GC warnings. Under a transient server-slowness window, this can yield build exit status 0 against a partial graph.

Closes #1001

What changed

  • infrahub_sdk/batch.py: wrapped the asyncio.as_completed iteration in try/finally that cancels any still-running task and gather(..., return_exceptions=True) to drain. Covers all generator exit paths (raise, caller break, generator close), not only the original raise path.
  • Tests:
    • test_execute_does_not_orphan_inflight_tasks_when_raising — regression guard for the orphan behavior. Asserts that sibling task side effects do not run to completion after execute() raises.
    • test_return_exceptions_yields_exceptions_indistinguishably_from_successes — pins the current return_exceptions=True contract (failures yield as (node, ExceptionInstance) in the same tuple shape as successes). Expected to change when that API is reworked; tracked separately.

How to review

Start with infrahub_sdk/batch.py. The diff is small and self-contained. The two new tests in tests/unit/sdk/test_batch.py run without HTTPX mocks (~0.4s total).

How to test

uv run pytest tests/unit/sdk/test_batch.py -v

Expected: 8 passed.

Impact & rollout

  • Backward compatibility: behavior change only on the failure path. Healthy batches behave identically. Callers that previously relied on orphaned siblings continuing in the background after their batch raised will no longer get that — that reliance was undefined behavior (and silent data loss).
  • Performance: unchanged.
  • Config/env changes: none.
  • Deployment notes: safe to deploy.

Checklist

  • Tests added/updated
  • Changelog entry added (changelog/1001.fixed.md)
  • External docs updated (n/a — internal behavior fix)
  • Internal .md docs updated (n/a)

InfrahubBatch.execute() created one asyncio.Task per BatchTask but did
not clean them up when the generator stopped iterating. With
return_exceptions=False, the first failing task caused execute() to
re-raise while siblings kept running on the event loop with no caller
awaiting them. Successful sibling work was discarded silently;
failing siblings surfaced only as "Task exception was never retrieved"
GC warnings.

Wrap the as_completed loop in try/finally that cancels any still-
running task and gather()s with return_exceptions=True. Covers all
generator exit paths: explicit raise, caller break, generator close.

Closes #1001
@PhillSimonds PhillSimonds requested a review from a team as a code owner May 8, 2026 17:41
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying infrahub-sdk-python with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2d17f96
Status: ✅  Deploy successful!
Preview URL: https://3a664adc.infrahub-sdk-python.pages.dev
Branch Preview URL: https://phill-fix-batch-orphan-tasks.infrahub-sdk-python.pages.dev

View logs

@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
infrahub_sdk/batch.py 80.00% 1 Missing and 1 partial ⚠️
@@             Coverage Diff             @@
##           develop    #1002      +/-   ##
===========================================
- Coverage    81.43%   81.42%   -0.02%     
===========================================
  Files          134      134              
  Lines        11359    11352       -7     
  Branches      1703     1705       +2     
===========================================
- Hits          9250     9243       -7     
  Misses        1566     1566              
  Partials       543      543              
Flag Coverage Δ
integration-tests 41.85% <60.00%> (-0.06%) ⬇️
python-3.10 54.36% <80.00%> (-0.05%) ⬇️
python-3.11 54.38% <80.00%> (-0.03%) ⬇️
python-3.12 54.36% <80.00%> (-0.03%) ⬇️
python-3.13 54.36% <80.00%> (-0.03%) ⬇️
python-3.14 54.36% <80.00%> (-0.03%) ⬇️
python-filler-3.12 22.73% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
infrahub_sdk/batch.py 94.73% <80.00%> (+0.37%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@ajtmccarty ajtmccarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functional code change looks good. comments on a the code comment and the second test

Comment thread infrahub_sdk/batch.py
Comment on lines +90 to +92
# Ensure no task created here outlives execute(). Cancel any still
# running, then drain so their exceptions are retrieved instead of
# surfacing later as "Task exception was never retrieved".
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that you get, at most, 2 lines for a comment
When raising an error, all tasks not done are canceled and/or drained

Comment on lines +171 to +175
Failures are yielded as ``(node, ExceptionInstance)`` using the same tuple
shape as successes ``(node, result)``. The yielded ``node`` is whatever the
caller passed via ``batch.add(..., node=...)`` regardless of outcome, so a
consumer that does not ``isinstance``-check ``result`` cannot tell a failed
task from a successful one and will silently treat both as "created".
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just the expected behavior when return_exceptions=True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants