Skip to content

parquet: Allow setting parquet rowgroup row count limit#16317

Open
sqd wants to merge 1 commit into
apache:mainfrom
sqd:oss_parquet_rowgroup_row_limit
Open

parquet: Allow setting parquet rowgroup row count limit#16317
sqd wants to merge 1 commit into
apache:mainfrom
sqd:oss_parquet_rowgroup_row_limit

Conversation

@sqd
Copy link
Copy Markdown
Contributor

@sqd sqd commented May 13, 2026

parquet-java/parquet-mr supports this limit (config key parquet.block.row.count.limit) but Iceberg doesn't support setting it yet. This commit simply wires it up.

Parquet.WriteBuilder has two terminal paths: when callers supply a createWriterFunc (Iceberg's ParquetValueWriter — the path every production engine integration uses, such as Spark rewrite_data_files), the build returns Iceberg's own ParquetWriter, which manages the row-group lifecycle itself and ignores parquet-mr's auto-roll. When callers supply a WriteSupport instead, we delegate to parquet-mr's ParquetWriter, which enforces row-group limits internally.

The new property has to be wired into both: on the Iceberg path it is carried via ParquetProperties and consumed by an explicit recordCount check in ParquetWriter.checkSize(); on the parquet-mr path it is passed through ParquetWriteBuilder.withRowGroupRowCountLimit() and enforced by parquet-mr.

parquet-java/parquet-mr supports this limit (config key
parquet.block.row.count.limit) but Iceberg doesn't support setting it
yet. This commit simply wires it up.

Parquet.WriteBuilder has two terminal paths: when callers supply a
createWriterFunc (Iceberg's ParquetValueWriter — the path every
production engine integration uses, such as rewrite_data_files), the
build returns Iceberg's own ParquetWriter, which manages the row-group
lifecycle itself and ignores parquet-mr's auto-roll. When callers supply
a WriteSupport instead, we delegate to parquet-mr's ParquetWriter, which
enforces row-group limits internally.

The new property has to be wired into both: on the Iceberg path it is
carried via ParquetProperties and consumed by an explicit recordCount
check in ParquetWriter.checkSize(); on the parquet-mr path it is passed
through ParquetWriteBuilder.withRowGroupRowCountLimit() and enforced by
parquet-mr.
@sqd sqd changed the title Allow setting parquet rowgroup row count limit parquet: Allow setting parquet rowgroup row count limit May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant