fix: count shared buffers once in hash join build-side memory accounting by jordepic · Pull Request #22862 · apache/datafusion

jordepic · 2026-06-09T22:39:57Z

Which issue does this PR close?

Closes Accurately reserve memory in the build side of hash joins #22861.

Rationale for this change

When using DataFusion comet I noticed that my hash join operator was failing with the following error: Failed to acquire 142606336 bytes where 17142251456 bytes already reserved and the fair limit is 17179869184 bytes, 4 registered. Looking into this more, DataFusion asks to reserve memory for each batch (by default 8192 rows) of the build side of a hash join - and tries to reserve (without actually allocating it) num_batches * batch_size. This is problematic when these are batches are zero-copy slices of a larger batch (e.g. GroupedHashAggregateStream), since the slice size is evaluated to be the size of the larger buffer. This is because the reference to the slice actually keeps the entire buffer from being freed. DataFusion doesn't overallocate memory (the underlying data is the same), but it does over-request it (in the centralized accounting system), which can lead to these "ResourcesExhausted" exceptions.

What changes are included in this PR?

In this change, we keep track of all of the buffers that we've already counted via a set of pointers. This way, we don't redundantly request memory for the whole arrow buffer for each sub-slice of it. We choose this approach as opposed to just requesting a smaller amount of memory per batch, because as mentioned before, the pointer to each batch technically keeps the entire arrow-buffer from being freed.

Are these changes tested?

The new hash join test fails on main with ResourcesExhausted and passes with this change.

Are there any user-facing changes?

No breaking changes. Adds a new public helper count_record_batch_memory_size to datafusion-common.

2010YOUY01

Thank you! Should be good to go after CI passes.

I left a suggestion for you to consider.

2010YOUY01 · 2026-06-10T00:45:14Z

+/// [`get_record_batch_memory_size`] on each such slice counts the shared
+/// buffers once per slice, while sharing `counted_buffers` across the calls
+/// counts each buffer exactly once.
+pub fn count_record_batch_memory_size(


We could make it a deeper module like

/// Tracks memory already accounted for across multiple `RecordBatch`es. /// /// Some batches may share the same underlying Arrow buffers, for example when /// they are zero-copy slices of a larger batch. This counter remembers buffer /// start addresses so shared buffers are counted only once. #[derive(Default)] struct RecordBatchMemoryCounter { accounted_buffers: HashSet<usize>, memory_usage: usize, } impl RecordBatchMemoryCounter { /// Accounts for buffers in `batch` that have not already been seen. fn count_batch(&mut self, batch: &RecordBatch) { self.memory_usage += count_record_batch_memory_size( batch, &mut self.accounted_buffers, ); } /// Returns the total memory accounted for so far. fn memory_usage(&self) -> usize { self.memory_usage } }

The benefit is that users such as HashJoinExec do not need to know about the implementation details (e.g. the HashSet used to track buffer addresses). It also makes the intent clearer for readers who are not already familiar with this context.

Since this issue is not specific to hash joins, a simpler abstraction could make it easier to reuse.

@2010YOUY01 I went ahead and implemented that, thank you for the speedy review! Feel free to merge it once CI passes :)

github-actions · 2026-06-10T00:49:24Z

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details

     Cloning apache/main
    Building datafusion-common v54.0.0 (current)
       Built [  43.064s] (current)
     Parsing datafusion-common v54.0.0 (current)
      Parsed [   0.064s] (current)
    Building datafusion-common v54.0.0 (baseline)
       Built [  33.981s] (baseline)
     Parsing datafusion-common v54.0.0 (baseline)
      Parsed [   0.066s] (baseline)
    Checking datafusion-common v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   1.050s] 223 checks: 221 pass, 2 fail, 0 warn, 30 skip

--- failure function_missing: pub fn removed or renamed ---

Description:
A publicly-visible function cannot be imported by its prior path. A `pub use` may have been removed, or the function itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/function_missing.ron

Failed in:
  function datafusion_common::validate_range_split_points, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/06d741d036fb4072764612bc9ba5d13c496fa1cc/datafusion/common/src/partitioning.rs:73

--- failure struct_missing: pub struct removed or renamed ---

Description:
A publicly-visible struct cannot be imported by its prior path. A `pub use` may have been removed, or the struct itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/struct_missing.ron

Failed in:
  struct datafusion_common::SplitPoint, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/06d741d036fb4072764612bc9ba5d13c496fa1cc/datafusion/common/src/partitioning.rs:43

     Summary semver requires new major version: 2 major and 0 minor checks failed
    Finished [  80.763s] datafusion-common
    Building datafusion-physical-plan v54.0.0 (current)
       Built [  36.267s] (current)
     Parsing datafusion-physical-plan v54.0.0 (current)
      Parsed [   0.140s] (current)
    Building datafusion-physical-plan v54.0.0 (baseline)
       Built [  35.711s] (baseline)
     Parsing datafusion-physical-plan v54.0.0 (baseline)
      Parsed [   0.142s] (baseline)
    Checking datafusion-physical-plan v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.896s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  74.967s] datafusion-physical-plan

Samyak2

There was some discussion here regarding this: #22526 (comment)

Looks good to me. We have been using a very similar API internally to fix such issues. Left a minor comment

Samyak2 · 2026-06-10T03:31:50Z

+pub struct RecordBatchMemoryCounter {
+    /// Start addresses of `Buffer`s that have already been counted (instead of
+    /// actual used data region's pointer represented by current `Array`)
+    counted_buffers: HashSet<usize>,


We could use NonZero<usize> instead and then use as_ptr().addr() below.

ariel-miculas · 2026-06-10T07:40:13Z

I don't like this approach because it's a workaround for the hash aggregate issue, which is going to be reworked; additionally, it's not accurate memory accounting: when the first batch arrives, it reserves the memory for all the subsequent batches; this reservation could fail, meaning that if the operator supports spilling, it would spill on every batch (because each batch carries the entire memory reservation with it)
Since GroupedHashAggregateStream could have multiple downstream operators, is the plan to change the memory accouting for those as well?

2010YOUY01 · 2026-06-10T08:43:02Z

I don't like this approach because it's a workaround for the hash aggregate issue, which is going to be reworked; additionally, it's not accurate memory accounting: when the first batch arrives, it reserves the memory for all the subsequent batches; this reservation could fail, meaning that if the operator supports spilling, it would spill on every batch (because each batch carries the entire memory reservation with it)

Since GroupedHashAggregateStream could have multiple downstream operators, is the plan to change the memory accouting for those as well?

I agree that fixing the root issue in hash aggregate is the long-term solution.

The idea that operators should avoid returning small sliced batches backed by large buffers sounds like a reasonable property to follow. AggregateExec is the only violation I'm aware of today, though I'm not sure such a property can always be guaranteed given the diverse requirements of downstream projects.

So I think it's a good idea to be defensive in memory-intensive operators, and this PR makes the situation better (arguably in an entropy-reducing way). It also makes sense to apply the same approach to other memory-intensive operators.

I'm wondering whether you have a better alternative in mind, or if there are other concerns that would suggest we should not merge this?

ariel-miculas · 2026-06-10T09:09:53Z

I'm wondering whether you have a better alternative in mind, or if there are other concerns that would suggest we should not merge this?

For this particular issue I think get_sliced_size would be a better option, see: #22526 (comment)

Another concern I have is that operators should follow the same approach for reserving RecordBatches. As I mentioned before, this issue applies to other downstream consumers of HashAggregate.

ariel-miculas · 2026-06-10T09:24:29Z

AggregateExec is the only violation I'm aware of today, though I'm not sure such a property can always be guaranteed given the diverse requirements of downstream projects.

I also think we should design the memory accounting changes more intentionally, understand how we want the memory accounting feature to work and deduplicate the issues, e.g. I've raised a similar issue in #22526

2010YOUY01 · 2026-06-10T09:52:32Z

AggregateExec is the only violation I'm aware of today, though I'm not sure such a property can always be guaranteed given the diverse requirements of downstream projects.

I also think we should design the memory accounting changes more intentionally, understand how we want the memory accounting feature to work and deduplicate the issues, e.g. I've raised a similar issue in #22526

Yes, I agree. I believe the root cause of many existing memory-limited query bugs is that the current memory tracking protocol is ambiguous, and the behavior of operators and the memory pool is inconsistent.

We should definitely think through a better design for both operators and the memory pool, and then specify a clear protocol, so we can coordinate efforts around this issue.

For this particular issue I think get_sliced_size would be a better option, see: #22526 (comment)

I'm not sure which solution is better at this point. This requires deeper, more holistic thinking about the overall spilling-query design and may challenge some of our existing assumptions. I'll spend more time thinking about it later.

Regarding this PR, I don't think it a major architectural commitment. Even if we later decide to switch to a get_sliced_size approach, that change can be easily made entirely within this module. It also doesn't seem to make the system worse in the meantime. So I'm inclined to merge it first, while leaving it open for a few more days for additional discussion.

Samyak2 · 2026-06-10T09:52:33Z

I don't like this approach because it's a workaround for the hash aggregate issue, which is going to be reworked; additionally, it's not accurate memory accounting: when the first batch arrives, it reserves the memory for all the subsequent batches; this reservation could fail, meaning that if the operator supports spilling, it would spill on every batch (because each batch carries the entire memory reservation with it)

Continuing from #22526 (comment), this happens with the current memory tracking too. I see this PR as a strict improvement over the current get_record_batch_memory_size usage.

Since GroupedHashAggregateStream could have multiple downstream operators, is the plan to change the memory accouting for those as well?

I would imagine so. As an example, the same problem currently exists with Repartition over Aggregate. The same fix would be needed in any operator that stores a sequence of RecordBatches.

For Repartition specifically, we would need to extend this API to be able to remove a record batch too.

ariel-miculas · 2026-06-10T10:00:18Z

Ok, I agree it's improving the existing state and there's no better short-term solution.

The hash join build side reserves get_record_batch_memory_size(&batch) per collected batch. That function deduplicates shared buffers only within a single batch, so when the build input emits zero-copy slices of one larger batch (e.g. GroupedHashAggregateStream emitting its result in batch_size chunks), every slice is charged the full parent allocation: an aggregate output of S bytes in n slices reserves n * S for S bytes of physical memory. Since the build collection cannot spill, the inflated reservation aborts queries that fit in memory with large headroom (observed: 26GB reserved for 136MB resident). Add RecordBatchMemoryCounter, which tracks the buffers counted so far across a sequence of batches and counts each buffer exactly once, and use it in the build-side collection so each buffer is reserved exactly once.

jordepic · 2026-06-10T11:12:34Z

Thanks for the detailed discussion all, sorry I was asleep for all of it!

Understood on all points looking for a longer term solution, but as things stand this is a pretty nefarious issue in datafusion and I think a short term patch that's easily modifiable (and not really part of the public API) is the best route to take.

ariel-miculas · 2026-06-10T11:20:45Z

+    memory_usage: usize,
+}
+
+impl RecordBatchMemoryCounter {


It would be useful to also have a clear method for operators which spill and want to reset the memory counter

Can you elaborate on the API you're expecting? I'm also happy to just cross that bridge as we need it, as nobody is calling it just yet

a clear method which resets memory_usage to 0 and clears the counted_buffers hash_set

I had an alternate API in mind, but I don't know the details of spilling, so not sure if this is viable.

A uncount_batch (or similarly named) method that stops tracking a batch. This would mean a HashMap of pointer -> number of occurrences instead of a HashSet. This API is needed for RepartitionExec after a batch is consumed from the channel. I was thinking we could use the same for spill.

I would also be inclined towards adding this when we need it, instead of making this PR bigger.

github-actions Bot added common Related to common crate physical-plan Changes to the physical-plan crate labels Jun 9, 2026

2010YOUY01 approved these changes Jun 10, 2026

View reviewed changes

github-actions Bot added the auto detected api change Auto detected API change label Jun 10, 2026

jordepic force-pushed the fix-hash-join-build-side-shared-buffer-accounting branch 2 times, most recently from bcb4574 to 6443c06 Compare June 10, 2026 03:29

Samyak2 approved these changes Jun 10, 2026

View reviewed changes

Samyak2 mentioned this pull request Jun 10, 2026

Hash aggregation produces batches reporting huge memory size #22526

Open

ariel-miculas approved these changes Jun 10, 2026

View reviewed changes

jordepic force-pushed the fix-hash-join-build-side-shared-buffer-accounting branch from 6443c06 to 783414b Compare June 10, 2026 11:09

ariel-miculas reviewed Jun 10, 2026

View reviewed changes

Conversation

jordepic commented Jun 9, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samyak2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ariel-miculas commented Jun 10, 2026

Uh oh!

2010YOUY01 commented Jun 10, 2026

Uh oh!

ariel-miculas commented Jun 10, 2026

Uh oh!

ariel-miculas commented Jun 10, 2026

Uh oh!

2010YOUY01 commented Jun 10, 2026

Uh oh!

Samyak2 commented Jun 10, 2026

Uh oh!

ariel-miculas commented Jun 10, 2026

Uh oh!

jordepic commented Jun 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented Jun 10, 2026 •

edited

Loading