Skip to content

Rich t kid/multi dictionary benchmarks#22859

Open
Rich-T-kid wants to merge 2 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/Multi-dictionary-benchmarks
Open

Rich t kid/multi dictionary benchmarks#22859
Rich-T-kid wants to merge 2 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/Multi-dictionary-benchmarks

Conversation

@Rich-T-kid

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Works towards closing #21878

Rationale for this change

There is currently no benchmarks that cover GroupValueRow() when dictionary arrays are passed in. #22004 adds benchmarks for the single single column case. This PR adds benchmarks for the multi column case

What changes are included in this PR?

Adds a new Criterion harness for GroupValues over multi-column Dictionary<UInt64, Utf8> GROUP BY workloads, covering 4 and 8 group-by columns across batch sizes of 8 KiB and 64 KiB rows. It tests four realistic cardinalities (20, 100, 500, 1 000 distinct values) with per-column variance introduced by sampling each column's distinct count from a ±5% window around the target, so columns within the same batch have slightly different cardinalities. Two benchmark variants are included — repeated intern+emit and a partial-emit pattern that spills half the accumulated groups after each intern — to stress both the steady-state and incremental-flush paths of the group values implementation.

Are these changes tested?

n/a

Are there any user-facing changes?

No

@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Jun 9, 2026

@Rich-T-kid Rich-T-kid left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to take another review tomorrow before pinging a reviewer to take a look!

Arc::new(DictionaryArray::<UInt64Type>::try_new(key_array, values).unwrap())
}

fn make_batch(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a value type as a parameter here. strings are currently covered, in the future it may make sense for other types to also be covered. I left that out for this PR but I can introduce it if needed.

/// each column's distinct count is sampled from [target*0.95, target*1.05].
const CARDINALITY_RANGE: usize = 10;

fn schema_for_cols(n_cols: usize) -> SchemaRef {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if/when we introduce other value data types, this function needs to change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize Hash Aggregation for Multiple Dictionary-Encoded Group Keys

1 participant