Skip to content

fix: scope large binary storage and cleanup by execution id (#5280) [release/v1.2 backport]#5602

Open
kunwp1 wants to merge 2 commits into
apache:release/v1.2from
kunwp1:backport/5280-large-binary-execution-id-v1.2
Open

fix: scope large binary storage and cleanup by execution id (#5280) [release/v1.2 backport]#5602
kunwp1 wants to merge 2 commits into
apache:release/v1.2from
kunwp1:backport/5280-large-binary-execution-id-v1.2

Conversation

@kunwp1

@kunwp1 kunwp1 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

Backport of #5280 ("scope large binary storage and cleanup by execution id") to release/v1.2.

As noted in #5569, #5280 cannot be backported as a single cherry-pick: its changes to large_binary_output_stream.py build on top of #5249 ("keep failed-upload cleanup inside the upload worker"), which is not present on release/v1.2. Cherry-picking #5280 alone leaves a dangling reference to the large_binary_manager module in _cleanup_failed_upload() (a method #5249 removes), producing a NameError at runtime.

This PR therefore backports the dependency chain, both as clean git cherry-pick -x of the original squash commits:

  1. fix: keep failed-upload cleanup inside the upload worker #5249 — keep failed-upload cleanup inside the upload worker (prerequisite)
  2. fix: scope large binary storage and cleanup by execution id #5280 — scope large binary storage and cleanup by execution id

#4707 (which #5249 follows up on) is already present on release/v1.2, so the chain stops there. After applying both commits, every file touched by #5280 is byte-identical to its state on main at the merged #5280 commit (48e800e4), and release/v1.2 had no independent changes to any of those files.

Any related issues, documentation, discussions?

Backports #5280 (which closes #4123 on main). Prerequisite: #5249. Unblocks the #5569 backport, which can then cherry-pick cleanly onto release/v1.2.

How was this PR tested?

This is a backport with no changes beyond the two cherry-picked commits, so it relies on the existing tests carried over from #5249 and #5280 (LargeBinaryManagerSpec, LargeBinaryManagerUnitSpec, test_large_binary_output_stream.py, test_large_binary_manager.py, etc.).

Backport fidelity was verified locally by confirming that, after the chain is applied onto release/v1.2:

Full compile and unit-test runs are left to CI.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

Ma77Ball and others added 2 commits June 9, 2026 22:54
### What changes were proposed in this PR?
**What Caused the Issue:**
LargeBinaryOutputStream looked up the S3 client twice: once in the
upload worker (correct), and once again in close() during failed-upload
cleanup. When a test left a stream unclosed, Python's GC eventually
called __del__ → close() → the second lookup, but by then a different
test was active, so the cleanup hit the wrong test's mock_s3 and broke
its assert_called_once_with.

**Proposed Fix**
- Move `s3.delete_object(...)` from `_cleanup_failed_upload()` into the
upload worker, reusing the `s3` client already captured by the closure
that did the upload.
- Drop the `_cleanup_failed_upload()` method and the call to it from
`close()`; the worker now handles cleanup before recording the
exception.
- `close()` and `__del__` no longer call back into
`large_binary_manager`, so a finalizer firing under a later test's
monkey-patches cannot reach the wrong S3 client.
### Any related issues, documentation, or discussions?
Closes: apache#5245 Follow-up to apache#4707; surfaced on the 3.12 leg of
https://github.com/apache/texera/actions/runs/26481776334/job/77980417021.
### How was this PR tested?
- Ran `ruff format` and `ruff check` over `amber/src/main/python` and
`amber/src/test/python` (clean).
- Existing tests in `test_large_binary_output_stream.py` still cover the
relevant paths: `test_close_handles_upload_error`,
`test_delete_object_failure_is_swallowed`, and
`test_write_after_upload_error_raises_error`.
- Simplified `test_write_after_upload_error_raises_error` back to inline
form and removed the `_drained` helper, both no longer needed once
cleanup is structurally contained.
### Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.7 in compliance with ASF

---------

Signed-off-by: Matthew B. <mgball@uci.edu>
(cherry picked from commit 09aac02)
)

### What changes were proposed in this PR?

Large binaries were stored in the shared `texera-large-binaries` bucket
under flat keys `objects/{timestamp}/{uuid}` with no execution id, and
`clearExecutionResources(eid)` deleted all of them via
`LargeBinaryManager.deleteAllObjects()`. Any cleanup for one execution
therefore erased every other execution's (and user's) large binaries.

This PR namespaces every large binary by its execution id and scopes
deletion:

- Object keys are now `objects/{eid}/{uuid}` on both the JVM and Python
workers.
- The execution-scoped location is named by the controller and handed to
workers as data on `WorkerConfig` — no protobuf change. The controller
computes the base URI `s3://{bucket}/objects/{eid}/`, and `create()`
appends a unique suffix; the JVM seeds the base URI onto the
data-processing thread at startup, and the Python worker receives it as
a startup argument. The user-facing `largebinary()` / `new
LargeBinary()` APIs are unchanged.
- Cleanup uses the new `LargeBinaryManager.deleteByExecution(eid)`
(prefix delete of `objects/{eid}/`). Both JVM and Python engines share
the bucket and key shape, so this single JVM-side delete removes
binaries created by both.
- The `deleteAllObjects()` is removed.

Pre-existing objects under the old `objects/{timestamp}/...` scheme are
left untouched.

### Any related issues, documentation, discussions?

Closes apache#4123.

### How was this PR tested?

Import the following json file to create two workflows (You can
configure the source operator to use any kinds of files you have), run
them, and check if each execution creates 6 objects and one execution
doesn't remove the other execution's large binary objects.
[Large.Binary.Python
(1).json](https://github.com/user-attachments/files/28369502/Large.Binary.Python.1.json)

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), models Claude Opus 4.8, Claude
Opus 4.7, and Claude Sonnet 4.6

---------

Signed-off-by: Kunwoo (Chris) <143021053+kunwp1@users.noreply.github.com>
Co-authored-by: Xiaozhen Liu <xiaozl3@uci.edu>
(cherry picked from commit 48e800e)
@kunwp1 kunwp1 added the release/v1.2 back porting to release/v1.2 label Jun 10, 2026
@codecov-commenter

codecov-commenter commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.00000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.96%. Comparing base (8ab89e3) to head (17fe0f6).

Files with missing lines Patch % Lines
...pache/texera/service/util/LargeBinaryManager.scala 61.11% 4 Missing and 3 partials ⚠️
Additional details and impacted files
@@                Coverage Diff                 @@
##             release/v1.2    #5602      +/-   ##
==================================================
+ Coverage           51.41%   52.96%   +1.54%     
+ Complexity           2449     1410    -1039     
==================================================
  Files                1065      797     -268     
  Lines               41152    33085    -8067     
  Branches             4412     3317    -1095     
==================================================
- Hits                21159    17523    -3636     
+ Misses              18757    14721    -4036     
+ Partials             1236      841     -395     
Flag Coverage Δ *Carryforward flag
access-control-service 64.61% <ø> (ø)
agent-service 33.76% <ø> (ø) Carriedforward from 8ab89e3
amber 59.13% <61.11%> (+7.19%) ⬆️
computing-unit-managing-service 1.65% <ø> (ø)
config-service 55.38% <ø> (ø)
file-service 38.21% <ø> (ø)
frontend 45.85% <ø> (ø) Carriedforward from 8ab89e3
pyamber 90.72% <100.00%> (+0.02%) ⬆️
python 90.75% <ø> (-0.09%) ⬇️ Carriedforward from 8ab89e3
workflow-compiling-service 58.69% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kunwp1 kunwp1 requested a review from xuang7 June 10, 2026 06:03
@xuang7 xuang7 removed the release/v1.2 back porting to release/v1.2 label Jun 10, 2026

@xuang7 xuang7 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants