Skip loading Parquet page index when row-group statistics already prove it cannot prune by RatulDawar · Pull Request #22857 · apache/datafusion

RatulDawar · 2026-06-09T18:00:34Z

Which issue does this PR close?

Closes Skip loading the Parquet page index when row-group statistics already prove it cannot prune #22795

Rationale for this change

The Parquet opener was loading the page index (ColumnIndex + OffsetIndex) before row-group statistics pruning. When all surviving row groups are fully matched by row-group statistics (for example, IS NOT NULL on a non-null column), page index I/O cannot prune further and is wasted.

What changes are included in this PR?

Reorder the opener state machine: PrepareFilters → PruneWithStatistics → LoadPageIndex? → LoadBloomFilters
Skip load_page_index when there is no page-pruning predicate, no surviving row groups, or every surviving row group is fully matched
Add unit and integration tests for the gate and the fully-matched IS NOT NULL case

Are these changes tested?

cargo test -p datafusion-datasource-parquet should_load
cargo test -p datafusion-datasource-parquet page_index_skip
cargo test -p datafusion-datasource-parquet opener::test::test_page_pruning
cargo test -p datafusion --test parquet_integration
cargo clippy -p datafusion-datasource-parquet --all-targets -- -D warnings

Are there any user-facing changes?

No user-facing API changes. This reduces unnecessary Parquet page index I/O during scan planning when row-group statistics already prove no further pruning is possible.

Made with Cursor

…prune. Reorder the opener so row-group statistics pruning runs before the page index load, and skip that I/O when every surviving row group is fully matched. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Resolve opener test conflicts after upstream moved opener.rs to opener/mod.rs. Co-authored-by: Cursor <cursoragent@cursor.com>

RatulDawar · 2026-06-09T18:14:03Z

A related question came up while implementing this, can we skip page index I/O per row group (e.g. load index only for RGs that aren't fully matched)?

I checked arrow-rs (parquet 58.3.0 + latest main), but per RG page index apis doesn't seem to be avalible. We can take that implementation as a next setp to this(not sure though per page index skip would be that much beneficial or not).

kosiew

@RatulDawar
Thanks for working on this. I think the opener invariant is the right direction, but there
is still one cached-reader path where the skip policy can be bypassed before row-group pruning gets a chance to run. I think we should tighten that up before merging.

kosiew · 2026-06-10T04:58:17Z

        // unnecessary I/O. We decide later if it is needed to evaluate the
        // pruning predicates. Thus default to not requesting it from the
        // underlying reader.
        let mut options =


I think this still needs one more fix in the default cached-reader path.

This change relies on the initial metadata load honoring PageIndexPolicy::Skip, but ArrowReaderMetadata::load_async(...) can still call CachedParquetFileReader::get_metadata(). That path ignores the passed ArrowReaderOptions page-index policy and calls DFParquetMetadata::fetch_metadata() with a metadata cache. From there, the metadata layer forces PageIndexPolicy::Optional whenever a metadata cache exists.

The end result is that the opener can still load ColumnIndex and OffsetIndex during metadata loading, before should_load_page_index() gets a chance to skip it for fully matched row groups.

Could you please make this opener invariant hold end to end by threading or respecting the requested skip policy through the cached reader and metadata cache path? Another workable approach would be to prevent eager page-index fetching until after row-group pruning. It would also be good to add coverage using the default ParquetSource cached-reader path.

kosiew · 2026-06-10T04:59:26Z

+
+        let (_, rows) =
+            count_batches_and_rows(open_file(&morselizer, file).await.unwrap()).await;
+        assert_eq!(rows, 100);


Nice to have: this regression test currently only checks the row count, which would still pass even if the page index were loaded and evaluated.

After the cached-reader path is fixed, could we assert the invariant more directly? For example, the test could use a counting reader or object store that records or fails on page-index range reads, or it could assert a metric or state showing that LoadPageIndex was skipped. That would make the test much better at catching future reorderings that accidentally bring the extra I/O back.

kosiew · 2026-06-10T05:00:03Z

+        return false;
+    }
+
+    let is_fully_matched = row_groups.is_fully_matched();


Small cleanup suggestion: this helper could encode the invariant a bit more directly with is_some_and plus any, which avoids the early return and the double-negative !all(...).

page_pruning_predicate.is_some_and(|_| { let fully_matched = row_groups.is_fully_matched(); row_groups .row_group_indexes() .any(|idx| !fully_matched[idx]) })

RatulDawar and others added 3 commits June 9, 2026 00:39

Skip loading Parquet page index when row-group stats prove it cannot …

1845d8f

…prune. Reorder the opener so row-group statistics pruning runs before the page index load, and skip that I/O when every surviving row group is fully matched. Co-authored-by: Cursor <cursoragent@cursor.com>

Combine early-return conditions in should_load_page_index.

6323423

Co-authored-by: Cursor <cursoragent@cursor.com>

Document when page index loading is skipped in the opener.

e256fc5

Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions Bot added the datasource Changes to the datasource crate label Jun 9, 2026

Merge upstream/main into fix/skip-page-index-when-fully-matched.

c4efcd7

Resolve opener test conflicts after upstream moved opener.rs to opener/mod.rs. Co-authored-by: Cursor <cursoragent@cursor.com>

kosiew requested changes Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip loading Parquet page index when row-group statistics already prove it cannot prune#22857

Skip loading Parquet page index when row-group statistics already prove it cannot prune#22857
RatulDawar wants to merge 4 commits into
apache:mainfrom
RatulDawar:fix/skip-page-index-when-fully-matched

RatulDawar commented Jun 9, 2026

Uh oh!

RatulDawar commented Jun 9, 2026

Uh oh!

kosiew left a comment

Uh oh!

kosiew Jun 10, 2026

Uh oh!

kosiew Jun 10, 2026

Uh oh!

kosiew Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RatulDawar commented Jun 9, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

RatulDawar commented Jun 9, 2026

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

kosiew Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants