Skip loading Parquet page index when row-group statistics already prove it cannot prune#22857
Skip loading Parquet page index when row-group statistics already prove it cannot prune#22857RatulDawar wants to merge 4 commits into
Conversation
…prune. Reorder the opener so row-group statistics pruning runs before the page index load, and skip that I/O when every surviving row group is fully matched. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Resolve opener test conflicts after upstream moved opener.rs to opener/mod.rs. Co-authored-by: Cursor <cursoragent@cursor.com>
|
A related question came up while implementing this, can we skip page index I/O per row group (e.g. load index only for RGs that aren't fully matched)? I checked arrow-rs (parquet 58.3.0 + latest main), but per RG page index apis doesn't seem to be avalible. We can take that implementation as a next setp to this(not sure though per page index skip would be that much beneficial or not). |
kosiew
left a comment
There was a problem hiding this comment.
@RatulDawar
Thanks for working on this. I think the opener invariant is the right direction, but there
is still one cached-reader path where the skip policy can be bypassed before row-group pruning gets a chance to run. I think we should tighten that up before merging.
| // unnecessary I/O. We decide later if it is needed to evaluate the | ||
| // pruning predicates. Thus default to not requesting it from the | ||
| // underlying reader. | ||
| let mut options = |
There was a problem hiding this comment.
I think this still needs one more fix in the default cached-reader path.
This change relies on the initial metadata load honoring PageIndexPolicy::Skip, but ArrowReaderMetadata::load_async(...) can still call CachedParquetFileReader::get_metadata(). That path ignores the passed ArrowReaderOptions page-index policy and calls DFParquetMetadata::fetch_metadata() with a metadata cache. From there, the metadata layer forces PageIndexPolicy::Optional whenever a metadata cache exists.
The end result is that the opener can still load ColumnIndex and OffsetIndex during metadata loading, before should_load_page_index() gets a chance to skip it for fully matched row groups.
Could you please make this opener invariant hold end to end by threading or respecting the requested skip policy through the cached reader and metadata cache path? Another workable approach would be to prevent eager page-index fetching until after row-group pruning. It would also be good to add coverage using the default ParquetSource cached-reader path.
|
|
||
| let (_, rows) = | ||
| count_batches_and_rows(open_file(&morselizer, file).await.unwrap()).await; | ||
| assert_eq!(rows, 100); |
There was a problem hiding this comment.
Nice to have: this regression test currently only checks the row count, which would still pass even if the page index were loaded and evaluated.
After the cached-reader path is fixed, could we assert the invariant more directly? For example, the test could use a counting reader or object store that records or fails on page-index range reads, or it could assert a metric or state showing that LoadPageIndex was skipped. That would make the test much better at catching future reorderings that accidentally bring the extra I/O back.
| return false; | ||
| } | ||
|
|
||
| let is_fully_matched = row_groups.is_fully_matched(); |
There was a problem hiding this comment.
Small cleanup suggestion: this helper could encode the invariant a bit more directly with is_some_and plus any, which avoids the early return and the double-negative !all(...).
page_pruning_predicate.is_some_and(|_| {
let fully_matched = row_groups.is_fully_matched();
row_groups
.row_group_indexes()
.any(|idx| !fully_matched[idx])
})
Which issue does this PR close?
Rationale for this change
The Parquet opener was loading the page index (ColumnIndex + OffsetIndex) before row-group statistics pruning. When all surviving row groups are fully matched by row-group statistics (for example,
IS NOT NULLon a non-null column), page index I/O cannot prune further and is wasted.What changes are included in this PR?
PrepareFilters → PruneWithStatistics → LoadPageIndex? → LoadBloomFiltersload_page_indexwhen there is no page-pruning predicate, no surviving row groups, or every surviving row group is fully matchedIS NOT NULLcaseAre these changes tested?
cargo test -p datafusion-datasource-parquet should_loadcargo test -p datafusion-datasource-parquet page_index_skipcargo test -p datafusion-datasource-parquet opener::test::test_page_pruningcargo test -p datafusion --test parquet_integrationcargo clippy -p datafusion-datasource-parquet --all-targets -- -D warningsAre there any user-facing changes?
No user-facing API changes. This reduces unnecessary Parquet page index I/O during scan planning when row-group statistics already prove no further pruning is possible.
Made with Cursor