Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ea1a2a799d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Address maintainer feedback on PR #1065: 1. Bound the gRPC sink's forwarder channel by retry_max_buffer_size. Previously the Tokio mpsc was unbounded, so events accumulated freely while drive_stream was sleeping between reconnects or replaying the retry buffer — defeating the very knob meant to cap memory. The forwarder now uses try_send and signals overflow via an atomic flag; drive_stream sees the closed receiver, checks the flag, and exits with BufferFull so error_strategy applies. 2. Wire the configured timeout. RetryConfig.timeout was previously stored but never read, so 'bes_timeout' was a no-op. Wrap the sink's work in tokio::time::timeout so finite deadlines actually bound lifecycle calls, stream retries, and the final upload — matching Bazel's --bes_timeout semantics. 3. Drop the TODO about bessie authentication on the workflows BES sink: within the deployment, transport security is plain TLS.
|
|
The gRPC sink uses tonic::Status::unavailable directly, but tonic was only listed in Cargo.toml. Bazel's rust_library deps are the authoritative source, so the bazel build of //:cli failed with 'use of unresolved module or unlinked crate `tonic`'. Add @crates//:tonic alongside the other tokio-family deps.
Aspect CI Status ·
|
| Task | Job | Result | |
|---|---|---|---|
| ✅ passed | build |
build-task |
153 targets built |
| ✅ passed | build |
build-task |
153 targets built |
| ✅ passed | test |
test-task |
all 25 tests cached |
| ✅ passed | test |
test-task |
25/25 passed · 24 cached |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eecf20f93d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The bounded mpsc with try_send introduced in c7702e9 caused BES handling hangs in CI. Revert to the prior unbounded forwarder so the sink behaves the same as before that commit. The timeout wrapper from work() is kept — it's a pure no-op when 'bes_timeout' defaults to "0s"/None, and it actually wires up the knob when the user sets a finite deadline. The memory bound concern raised on PR #1065 still stands as a follow-up; bounding the forwarder safely needs more thought than the simple try_send/overflow flag did.
After sending last_message and half-closing the request side, drive_stream waited for the server to close the response stream before returning Done. Some BES backends keep the response stream open after the final ack, which made every successful upload hang indefinitely at end-of-build — visible in CI as build/test/lint/ delivery jobs running until the 12-minute cancellation. Exit Done as soon as last_message_sent is true and the retry buffer drains. The same shortcut applies for the upstream-closed path: if we already half-closed without a last_message and the buffer drains, return UpstreamClosed without waiting for the server.
Symptom: CI Build Task hits "153 Targets Built" then sits until the 20-minute job timeout. wait()/sink_handles.join() is blocked on the gRPC sink thread because drive_stream is parked in response_stream.next() after the request side was half-closed — the server never sent another ack and never closed the response stream. The previous "exit Done once buffer drains" only fired on an ack, so a server that goes silent post-half-close pinned the entire build. Add a 30s hard deadline that arms when sender_opt is dropped (either after sending last_message or after the upstream broadcaster closed). If we hit it, return Done/UpstreamClosed — the events were already buffered into tonic before we dropped the sender, so further waiting only wedges the build.
… timeouts The sink thread had three remaining hang surfaces that all wedged wait() at end-of-build: - Client::new (initial TLS/transport handshake) - no deadline. Now wrapped in a 10s timeout. - publish_build_tool_event_stream (bidi stream open) - no deadline, treats hang as transient -> reconnect attempt. - retry_lifecycle attempts - only retried on errors, never on hangs. Each attempt now wrapped in a 10s timeout that surfaces as DeadlineExceeded so the existing retry loop covers it. - Post-stream lifecycle (invocation_finished + build_finished) - now wrapped in a 30s hard budget. Failing them is non-fatal under the default Warn strategy; the events themselves were already streamed. Added "BES: ..." eprintln traces around connect, lifecycle calls, drive_stream entry/exit, half-close, and deadline expiry - makes it obvious from a CI log where the sink stalls. GHA workflow ci.yaml jobs now timeout-minutes: 10. Release pipelines (build_release_artifacts, prerelease_*, publish_release, tag, weekly_tag) untouched.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6a5022b876
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Two issues codex flagged: - drive_stream's reconnect path replays the buffered last_message but never half-closes the new stream, because the event-receive arm is disabled once last_message_sent is true. On entry, if last_message_sent is already true, close the request side ourselves and arm the half-close deadline immediately so a flaky backend cannot park drive_stream in response_stream.next() forever. - bes_timeout previously wrapped all of work_inner, terminating long builds whose runtime exceeded the upload timeout. Bazel documents --bes_timeout as the deadline for BES upload *after* the build finishes, so apply the user's timeout to the post-stream lifecycle budget instead. Default stays 30s.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a973b94456
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…=no-deadline Two codex findings: - When the half-close deadline fires with unacked events still in the retry buffer, drive_stream was returning Done so wait() ignored error_strategy. fail_at_end / abort users would see a clean success even though final BEP events were lost. Now Done / UpstreamClosed only fire when the buffer is empty at the deadline; otherwise return Transient(DeadlineExceeded) so the outer retry budget kicks in and, on exhaustion, surfaces a terminal SinkError. - Bazel's --bes_timeout (and our Starlark docs) say "0s" disables the deadline; the Starlark surface maps "0s" to None. The previous commit silently collapsed None to a 30s default, breaking the documented "no deadline" contract and risking a sink-induced fail_at_end / abort against slow BES backends. Restore: when timeout is None, await post_stream directly with no deadline.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d751cc3512
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Bazel documents --bes_timeout (and friends) as accepting `ms`, `s`, `m`, `h`, `d`. The previous parser rejected `d`, so a value copied from Bazel's reference like `"1d"` would fail validation before the build even started. Accept `d` (= 86_400s) and extend the unit test.
When Bazel is killed by a signal, `ExitStatus::code()` is None. The previous logic treated `code().unwrap_or(0) == 0` as eligible for sink-induced exit code 36, masking the abnormal termination (the comment above the block already said 36 is for the bazel-succeeded case). Only rewrite to Some(36) when result.code() is exactly Some(0); otherwise pass result.code() through, which yields None for signal kills.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 03bbe8f14b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…l exit The existing fail_at_end test only covered bazel-exited-0. The recent code() == Some(0) fix only matters when code() is None (signal kill) or non-zero — neither was covered. Add two basil-driven e2e cases: - grpc_error_strategy_fail_at_end_preserves_signal_kill: new basil scenario signal_killed_sigkill raises SIGKILL after flushing the event stream. wait() must report success=False, code=None, not collapse to Some(36). - grpc_error_strategy_fail_at_end_preserves_genuine_bazel_failure: basil scenario nonzero_exit exits 2. wait() must preserve 2, not overwrite with 36. To support this, basil's Scenario gains an ExitBehavior field (Code(n) or Signal(n)); existing scenarios default to Code(0). basil grows a libc dependency for raise(SIGKILL). Unit tests for finalize_exit_code complete the (bazel exit, fail_at_end) matrix without spawning basil.
…e tests basil's Scenario gains an ExitBehavior field so scenarios can model signal kills (Signal(SIGKILL)) and nonzero exits (Code(2)) in addition to clean exits. Existing scenarios default to Code(0). build.rs gains: - grpc_error_strategy_fail_at_end_preserves_signal_kill: bazel killed by SIGKILL must leave wait().code = None, not collapse to Some(36) under fail_at_end. - grpc_error_strategy_fail_at_end_preserves_genuine_bazel_failure: bazel exit 2 must surface as code = 2, not 36. - finalize_exit_code: extracted from wait() so unit tests can cover the full (bazel exit, fail_at_end) matrix without spawning basil.
Removes the libc dependency. Keeps the same behavior — `unsafe extern "C" fn raise(sig: i32) -> i32` is just the libc(3) signature. SIGKILL = 9 inlined to avoid the libc constant.
The Starlark parser no longer accepts "abort". The Abort enum variant stays so the internal execlog sink can still surface its terminal failures as Abort (those reflect real bugs, not flaky backends), but users can't reach it through bazel.build_events.grpc(error_strategy=...).
…t e2e
- Extract the wait() exit-code mapping into `finalize_exit_code` so it
can be unit-tested directly. The fix from earlier (Some(0) vs
unwrap_or(0)) is now covered by both unit tests and basil-driven
e2e tests.
- Drop `grpc_error_strategy_abort_propagates_as_eval_error`: the
Starlark surface no longer accepts "abort", so this test would fail
validation.
- New e2e tests (use basil scenarios `signal_killed_sigkill` and
`nonzero_exit`):
* preserves_signal_kill: fail_at_end with bazel killed by signal
must report code=None, not Some(36).
* preserves_genuine_bazel_failure: fail_at_end with bazel exit 2
must report code=2, not 36.
- Unit tests for finalize_exit_code cover the full
(bazel_code, fail_at_end) matrix.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0c7ba09b82
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…y docs Codex P3: the Starlark parser rejects "abort" but the public docstring still listed it as a valid value, so configs copied from the docs would fail validation before the build started. List only what the parser actually accepts: `"fail_at_end"`, `"warn"` (default), `"ignore"`.
Changes are visible to end-users: yes
Test plan