Skip to content

feat: support failure knobs for sinks#1065

Merged
thesayyn merged 22 commits into
mainfrom
bes-retry
May 11, 2026
Merged

feat: support failure knobs for sinks#1065
thesayyn merged 22 commits into
mainfrom
bes-retry

Conversation

@thesayyn
Copy link
Copy Markdown
Member

@thesayyn thesayyn commented May 6, 2026

Changes are visible to end-users: yes

  • Searched for relevant documentation and updated as needed: yes
  • Breaking change (forces users to change their own code or config): no
  • Suggested release notes appear below: yes/no

Test plan

  • New test cases added

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ea1a2a799d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/axl-runtime/src/engine/bazel/sink/grpc.rs
Comment thread crates/axl-runtime/src/engine/bazel/mod.rs
Comment thread crates/aspect-cli/src/builtins/aspect/feature/workflows.axl
Comment thread crates/aspect-cli/src/builtins/aspect/feature/workflows.axl Outdated
Address maintainer feedback on PR #1065:

1. Bound the gRPC sink's forwarder channel by retry_max_buffer_size.
   Previously the Tokio mpsc was unbounded, so events accumulated
   freely while drive_stream was sleeping between reconnects or
   replaying the retry buffer — defeating the very knob meant to cap
   memory. The forwarder now uses try_send and signals overflow via
   an atomic flag; drive_stream sees the closed receiver, checks the
   flag, and exits with BufferFull so error_strategy applies.

2. Wire the configured timeout. RetryConfig.timeout was previously
   stored but never read, so 'bes_timeout' was a no-op. Wrap the
   sink's work in tokio::time::timeout so finite deadlines actually
   bound lifecycle calls, stream retries, and the final upload —
   matching Bazel's --bes_timeout semantics.

3. Drop the TODO about bessie authentication on the workflows BES
   sink: within the deployment, transport security is plain TLS.
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

The gRPC sink uses tonic::Status::unavailable directly, but tonic
was only listed in Cargo.toml. Bazel's rust_library deps are the
authoritative source, so the bazel build of //:cli failed with
'use of unresolved module or unlinked crate `tonic`'.

Add @crates//:tonic alongside the other tokio-family deps.
@aspect-workflows
Copy link
Copy Markdown

aspect-workflows Bot commented May 7, 2026

Aspect CI Status · 2026-05-11T20:26:04Z

Task Job Result
✅ passed build build-task 153 targets built
✅ passed build build-task 153 targets built
✅ passed test test-task all 25 tests cached
✅ passed test test-task 25/25 passed · 24 cached

@thesayyn
Copy link
Copy Markdown
Member Author

thesayyn commented May 7, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eecf20f93d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/axl-runtime/src/engine/bazel/sink/grpc.rs Outdated
claude and others added 7 commits May 8, 2026 03:22
The bounded mpsc with try_send introduced in c7702e9 caused BES
handling hangs in CI. Revert to the prior unbounded forwarder so
the sink behaves the same as before that commit.

The timeout wrapper from work() is kept — it's a pure no-op when
'bes_timeout' defaults to "0s"/None, and it actually wires up the
knob when the user sets a finite deadline.

The memory bound concern raised on PR #1065 still stands as a
follow-up; bounding the forwarder safely needs more thought than
the simple try_send/overflow flag did.
After sending last_message and half-closing the request side,
drive_stream waited for the server to close the response stream
before returning Done. Some BES backends keep the response stream
open after the final ack, which made every successful upload hang
indefinitely at end-of-build — visible in CI as build/test/lint/
delivery jobs running until the 12-minute cancellation.

Exit Done as soon as last_message_sent is true and the retry
buffer drains. The same shortcut applies for the upstream-closed
path: if we already half-closed without a last_message and the
buffer drains, return UpstreamClosed without waiting for the
server.
Symptom: CI Build Task hits "153 Targets Built" then sits until
the 20-minute job timeout. wait()/sink_handles.join() is blocked
on the gRPC sink thread because drive_stream is parked in
response_stream.next() after the request side was half-closed —
the server never sent another ack and never closed the response
stream. The previous "exit Done once buffer drains" only fired
on an ack, so a server that goes silent post-half-close pinned
the entire build.

Add a 30s hard deadline that arms when sender_opt is dropped
(either after sending last_message or after the upstream
broadcaster closed). If we hit it, return Done/UpstreamClosed —
the events were already buffered into tonic before we dropped
the sender, so further waiting only wedges the build.
… timeouts

The sink thread had three remaining hang surfaces that all wedged
wait() at end-of-build:

- Client::new (initial TLS/transport handshake) - no deadline.
  Now wrapped in a 10s timeout.
- publish_build_tool_event_stream (bidi stream open) - no deadline,
  treats hang as transient -> reconnect attempt.
- retry_lifecycle attempts - only retried on errors, never on hangs.
  Each attempt now wrapped in a 10s timeout that surfaces as
  DeadlineExceeded so the existing retry loop covers it.
- Post-stream lifecycle (invocation_finished + build_finished) - now
  wrapped in a 30s hard budget. Failing them is non-fatal under the
  default Warn strategy; the events themselves were already streamed.

Added "BES: ..." eprintln traces around connect, lifecycle calls,
drive_stream entry/exit, half-close, and deadline expiry - makes it
obvious from a CI log where the sink stalls.

GHA workflow ci.yaml jobs now timeout-minutes: 10. Release pipelines
(build_release_artifacts, prerelease_*, publish_release, tag,
weekly_tag) untouched.
@thesayyn
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a5022b876

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/axl-runtime/src/engine/bazel/sink/grpc.rs
Comment thread crates/axl-runtime/src/engine/bazel/sink/grpc.rs Outdated
Two issues codex flagged:

- drive_stream's reconnect path replays the buffered last_message
  but never half-closes the new stream, because the event-receive
  arm is disabled once last_message_sent is true. On entry, if
  last_message_sent is already true, close the request side
  ourselves and arm the half-close deadline immediately so a flaky
  backend cannot park drive_stream in response_stream.next()
  forever.

- bes_timeout previously wrapped all of work_inner, terminating
  long builds whose runtime exceeded the upload timeout. Bazel
  documents --bes_timeout as the deadline for BES upload *after*
  the build finishes, so apply the user's timeout to the
  post-stream lifecycle budget instead. Default stays 30s.
@thesayyn
Copy link
Copy Markdown
Member Author

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a973b94456

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/axl-runtime/src/engine/bazel/sink/grpc.rs
Comment thread crates/axl-runtime/src/engine/bazel/sink/grpc.rs Outdated
…=no-deadline

Two codex findings:

- When the half-close deadline fires with unacked events still in the
  retry buffer, drive_stream was returning Done so wait() ignored
  error_strategy. fail_at_end / abort users would see a clean
  success even though final BEP events were lost. Now Done /
  UpstreamClosed only fire when the buffer is empty at the deadline;
  otherwise return Transient(DeadlineExceeded) so the outer retry
  budget kicks in and, on exhaustion, surfaces a terminal SinkError.

- Bazel's --bes_timeout (and our Starlark docs) say "0s" disables
  the deadline; the Starlark surface maps "0s" to None. The previous
  commit silently collapsed None to a 30s default, breaking the
  documented "no deadline" contract and risking a sink-induced
  fail_at_end / abort against slow BES backends. Restore: when
  timeout is None, await post_stream directly with no deadline.
@thesayyn
Copy link
Copy Markdown
Member Author

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d751cc3512

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/axl-runtime/src/engine/bazel/build.rs Outdated
Comment thread crates/axl-runtime/src/engine/bazel/sink/retry.rs
thesayyn added 2 commits May 11, 2026 12:35
Bazel documents --bes_timeout (and friends) as accepting `ms`, `s`,
`m`, `h`, `d`. The previous parser rejected `d`, so a value copied
from Bazel's reference like `"1d"` would fail validation before the
build even started. Accept `d` (= 86_400s) and extend the unit test.
When Bazel is killed by a signal, `ExitStatus::code()` is None.
The previous logic treated `code().unwrap_or(0) == 0` as eligible
for sink-induced exit code 36, masking the abnormal termination
(the comment above the block already said 36 is for the
bazel-succeeded case). Only rewrite to Some(36) when result.code()
is exactly Some(0); otherwise pass result.code() through, which
yields None for signal kills.
@thesayyn
Copy link
Copy Markdown
Member Author

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 03bbe8f14b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/axl-runtime/src/engine/bazel/build.rs
Comment thread crates/axl-runtime/src/engine/bazel/sink/grpc.rs
thesayyn added 7 commits May 11, 2026 12:46
…l exit

The existing fail_at_end test only covered bazel-exited-0. The
recent code() == Some(0) fix only matters when code() is None
(signal kill) or non-zero — neither was covered.

Add two basil-driven e2e cases:

- grpc_error_strategy_fail_at_end_preserves_signal_kill: new basil
  scenario signal_killed_sigkill raises SIGKILL after flushing the
  event stream. wait() must report success=False, code=None, not
  collapse to Some(36).
- grpc_error_strategy_fail_at_end_preserves_genuine_bazel_failure:
  basil scenario nonzero_exit exits 2. wait() must preserve 2, not
  overwrite with 36.

To support this, basil's Scenario gains an ExitBehavior field
(Code(n) or Signal(n)); existing scenarios default to Code(0).
basil grows a libc dependency for raise(SIGKILL).

Unit tests for finalize_exit_code complete the (bazel exit,
fail_at_end) matrix without spawning basil.
…e tests

basil's Scenario gains an ExitBehavior field so scenarios can model
signal kills (Signal(SIGKILL)) and nonzero exits (Code(2)) in
addition to clean exits. Existing scenarios default to Code(0).

build.rs gains:
- grpc_error_strategy_fail_at_end_preserves_signal_kill: bazel
  killed by SIGKILL must leave wait().code = None, not collapse to
  Some(36) under fail_at_end.
- grpc_error_strategy_fail_at_end_preserves_genuine_bazel_failure:
  bazel exit 2 must surface as code = 2, not 36.
- finalize_exit_code: extracted from wait() so unit tests can cover
  the full (bazel exit, fail_at_end) matrix without spawning basil.
Removes the libc dependency. Keeps the same behavior — `unsafe extern "C" fn raise(sig: i32) -> i32` is just the libc(3) signature.
SIGKILL = 9 inlined to avoid the libc constant.
The Starlark parser no longer accepts "abort". The Abort enum variant
stays so the internal execlog sink can still surface its terminal
failures as Abort (those reflect real bugs, not flaky backends), but
users can't reach it through bazel.build_events.grpc(error_strategy=...).
…t e2e

- Extract the wait() exit-code mapping into `finalize_exit_code` so it
  can be unit-tested directly. The fix from earlier (Some(0) vs
  unwrap_or(0)) is now covered by both unit tests and basil-driven
  e2e tests.

- Drop `grpc_error_strategy_abort_propagates_as_eval_error`: the
  Starlark surface no longer accepts "abort", so this test would fail
  validation.

- New e2e tests (use basil scenarios `signal_killed_sigkill` and
  `nonzero_exit`):
  * preserves_signal_kill: fail_at_end with bazel killed by signal
    must report code=None, not Some(36).
  * preserves_genuine_bazel_failure: fail_at_end with bazel exit 2
    must report code=2, not 36.

- Unit tests for finalize_exit_code cover the full
  (bazel_code, fail_at_end) matrix.
@thesayyn
Copy link
Copy Markdown
Member Author

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c7ba09b82

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/axl-runtime/src/engine/bazel/mod.rs Outdated
…y docs

Codex P3: the Starlark parser rejects "abort" but the public docstring
still listed it as a valid value, so configs copied from the docs would
fail validation before the build started. List only what the parser
actually accepts: `"fail_at_end"`, `"warn"` (default), `"ignore"`.
@thesayyn thesayyn merged commit 5d62231 into main May 11, 2026
19 of 20 checks passed
@thesayyn thesayyn deleted the bes-retry branch May 11, 2026 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants