Skip to content

parquet: cache per-file ArrowReaderMetadata for wide-schema scans#22830

Draft
adriangb wants to merge 1 commit into
apache:mainfrom
pydantic:adrian/wide-schema-arrow-metadata-cache
Draft

parquet: cache per-file ArrowReaderMetadata for wide-schema scans#22830
adriangb wants to merge 1 commit into
apache:mainfrom
pydantic:adrian/wide-schema-arrow-metadata-cache

Conversation

@adriangb

@adriangb adriangb commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Draft — blocked, not yet buildable. See "Status" below. Opened for
visibility of the design while the dependency lands.

Which issue does this PR close?

Part of the wide-schema parquet read performance work in #21968.

Rationale for this change

Building an ArrowReaderMetadata walks every leaf of the parquet schema
to produce the arrow Schema + dremel field levels. For wide-schema
files (hundreds/thousands of columns) this O(N_columns) walk runs on
every file open, once per query, even though the result is identical
across queries for the same file. It is one of the larger remaining
per-file costs identified in #21968.

What changes are included in this PR?

Cache the built ArrowReaderMetadata on CachedParquetMetaData (which
already lives in the file metadata cache):

  • arrow_reader_metadata() lazily builds the base metadata via
    parquet_to_arrow_schema_and_field_levels + from_field_levels and
    memoises it in a OnceLock; warm hits are a cheap Arc-bump clone.
  • coerced_arrow_reader_metadata() memoises a single post-coercion
    build keyed by the supplied schema's Arc identity.
  • CachedParquetFileReader overrides
    AsyncFileReader::get_arrow_reader_metadata to serve both from cache.

Status / dependencies

This depends on the arrow-rs primitives in apache/arrow-rs#9882
(from_field_levels, parquet_to_arrow_schema_and_field_levels,
ArrowReaderOptions accessors, AsyncFileReader::get_arrow_reader_metadata).

It is not buildable yet: DataFusion main pins arrow 58.3.0 while
those primitives are on arrow-rs main (59.0.0), so no Cargo [patch]
can satisfy ^58.3.0. CI will be red until the arrow-rs changes ship in
a release DataFusion bumps to (then the [patch]/version wiring is added
here). Kept as a draft until then.

The arrow-rs-independent pieces of the #21968 investigation
(memory_size caching, coercion early-return) are split out into a
separately-mergeable PR: #22829.

Are these changes tested?

Not yet (blocked on building, see above). Tests will be added once the
dependency is available.

Are there any user-facing changes?

No public API changes — CachedParquetMetaData gains internal caching
fields and methods.

Building an `ArrowReaderMetadata` walks every leaf of the parquet schema
to produce the arrow `Schema` + dremel field levels. For wide-schema
files this O(N_columns) walk runs on every file open, once per query,
even though the result is identical across queries for the same file.

Cache it on `CachedParquetMetaData` (which already lives in the file
metadata cache):

- `arrow_reader_metadata()` lazily builds the base
  `ArrowReaderMetadata` via `parquet_to_arrow_schema_and_field_levels` +
  `ArrowReaderMetadata::from_field_levels` and memoises it in a
  `OnceLock`; warm hits are a cheap `Arc`-bump clone.
- `coerced_arrow_reader_metadata()` memoises a single post-coercion
  build keyed by the supplied schema's `Arc` identity (the common case:
  every file coerces to the same table schema).
- `CachedParquetFileReader` overrides
  `AsyncFileReader::get_arrow_reader_metadata` to serve both from cache,
  falling back to a fetch + `try_new` on miss.

Depends on the arrow-rs primitives in apache/arrow-rs#9882
(`from_field_levels`, `parquet_to_arrow_schema_and_field_levels`,
`ArrowReaderOptions` accessors, `AsyncFileReader::get_arrow_reader_metadata`).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion-datasource-parquet v53.1.0 (current)
error: running cargo-doc on crate 'datafusion-datasource-parquet' failed with output:
-----
   Compiling proc-macro2 v1.0.106
   Compiling quote v1.0.45
   Compiling unicode-ident v1.0.24
   Compiling libc v0.2.186
    Checking cfg-if v1.0.4
   Compiling autocfg v1.5.1
   Compiling libm v0.2.16
   Compiling num-traits v0.2.19
    Checking memchr v2.8.1
   Compiling zerocopy v0.8.50
   Compiling syn v2.0.117
    Checking bytes v1.11.1
    Checking equivalent v1.0.2
    Checking allocator-api2 v0.2.21
    Checking itoa v1.0.18
   Compiling serde_core v1.0.228
    Checking foldhash v0.2.0
    Checking once_cell v1.21.4
   Compiling getrandom v0.3.4
    Checking hashbrown v0.17.1
   Compiling zmij v1.0.21
   Compiling serde_json v1.0.150
    Checking num-integer v0.1.46
    Checking indexmap v2.14.0
    Checking iana-time-zone v0.1.65
    Checking stable_deref_trait v1.2.1
    Checking siphasher v1.0.3
   Compiling version_check v0.9.5
    Checking phf_shared v0.12.1
   Compiling ahash v0.8.12
    Checking chrono v0.4.45
    Checking num-bigint v0.4.6
   Compiling chrono-tz v0.10.4
    Checking phf v0.12.1
   Compiling shlex v2.0.1
   Compiling find-msvc-tools v0.1.9
   Compiling synstructure v0.13.2
   Compiling jobserver v0.1.34
    Checking arrow-schema v58.3.0
   Compiling cc v1.2.63
    Checking num-complex v0.4.6
    Checking lexical-util v1.0.7
    Checking smallvec v1.15.1
   Compiling zerocopy-derive v0.8.50
   Compiling zerofrom-derive v0.1.7
   Compiling yoke-derive v0.8.2
    Checking zerofrom v0.1.8
   Compiling zerovec-derive v0.11.3
    Checking yoke v0.8.3
   Compiling displaydoc v0.2.6
    Checking litemap v0.8.2
    Checking writeable v0.6.3
    Checking zerovec v0.11.6
   Compiling pkg-config v0.3.33
    Checking tinystr v0.8.3
   Compiling zstd-sys v2.0.16+zstd.1.5.7
    Checking icu_locale_core v2.2.0
    Checking potential_utf v0.1.5
    Checking zerotrie v0.2.4
   Compiling icu_normalizer_data v2.2.0
    Checking utf8_iter v1.0.4
   Compiling icu_properties_data v2.2.0
    Checking pin-project-lite v0.2.17
    Checking icu_collections v2.2.0
    Checking icu_provider v2.2.0
   Compiling semver v1.0.28
    Checking futures-core v0.3.32
    Checking futures-sink v0.3.32
    Checking futures-channel v0.3.32
   Compiling rustc_version v0.4.1
   Compiling futures-macro v0.3.32
    Checking lexical-parse-integer v1.0.6
    Checking lexical-write-integer v1.0.6
   Compiling parking_lot_core v0.9.12
   Compiling zstd-safe v7.2.4
    Checking bitflags v2.13.0
    Checking slab v0.4.12
    Checking futures-io v0.3.32
    Checking futures-task v0.3.32
    Checking lexical-write-float v1.0.6
    Checking futures-util v0.3.32
    Checking lexical-parse-float v1.0.6
    Checking half v2.7.1
   Compiling flatbuffers v25.12.19
    Checking arrow-buffer v58.3.0
    Checking icu_properties v2.2.0
    Checking arrow-data v58.3.0
    Checking icu_normalizer v2.2.0
    Checking arrow-array v58.3.0
    Checking aho-corasick v1.1.4
    Checking base64 v0.22.1
    Checking unicode-segmentation v1.13.3
    Checking ryu v1.0.23
    Checking regex-syntax v0.8.10
    Checking scopeguard v1.2.0
    Checking unicode-width v0.2.2
    Checking comfy-table v7.2.2
    Checking lock_api v0.4.14
    Checking idna_adapter v1.2.2
    Checking lexical-core v1.0.6
   Compiling tokio-macros v2.7.0
    Checking arrow-select v58.3.0
    Checking regex-automata v0.4.14
    Checking atoi v2.0.0
    Checking percent-encoding v2.3.2
   Compiling thiserror v2.0.18
    Checking alloc-no-stdlib v2.0.4
    Checking twox-hash v2.1.2
    Checking arrow-ord v58.3.0
   Compiling getrandom v0.4.2
    Checking lz4_flex v0.13.1
    Checking alloc-stdlib v0.2.2
    Checking arrow-cast v58.3.0
    Checking form_urlencoded v1.2.2
    Checking tokio v1.52.3
    Checking regex v1.12.3
    Checking idna v1.1.0
    Checking futures-executor v0.3.32
   Compiling ring v0.17.14
   Compiling thiserror-impl v2.0.18
   Compiling tracing-attributes v0.1.31
    Checking tracing-core v0.1.36
    Checking csv-core v0.1.13
    Checking same-file v1.0.6
   Compiling paste v1.0.15
    Checking simdutf8 v0.1.5
   Compiling snap v1.1.1
    Checking either v1.16.0
    Checking tracing v0.1.44
    Checking walkdir v2.5.0
    Checking itertools v0.14.0
    Checking csv v1.4.0
    Checking url v2.5.8
    Checking futures v0.3.32
    Checking brotli-decompressor v5.0.1
    Checking parking_lot v0.12.5
   Compiling async-trait v0.1.89
    Checking ordered-float v2.10.1
    Checking http v1.4.2
    Checking getrandom v0.2.17
    Checking integer-encoding v3.0.4
    Checking zlib-rs v0.6.3
    Checking humantime v2.3.0
    Checking byteorder v1.5.0
    Checking untrusted v0.9.0
    Checking thrift v0.17.0
    Checking object_store v0.13.2
    Checking flate2 v1.1.9
    Checking brotli v8.0.3
    Checking zstd v0.13.3
    Checking arrow-ipc v58.3.0
    Checking arrow-csv v58.3.0
    Checking arrow-json v58.3.0
    Checking arrow-string v58.3.0
    Checking arrow-row v58.3.0
    Checking arrow-arith v58.3.0
   Compiling seq-macro v0.3.6
    Checking log v0.4.32
    Checking uuid v1.23.2
    Checking arrow v58.3.0
    Checking hex v0.4.3
   Compiling pin-project-internal v1.1.13
   Compiling crossbeam-utils v0.8.21
   Compiling rustix v1.1.4
    Checking ppv-lite86 v0.2.21
    Checking parquet v58.3.0
    Checking rand_core v0.9.5
    Checking linux-raw-sys v0.12.1
    Checking datafusion-doc v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/doc)
    Checking rand_chacha v0.9.0
    Checking pin-project v1.1.13
    Checking hashbrown v0.14.5
    Checking foldhash v0.1.5
    Checking fastrand v2.4.1
    Checking hashbrown v0.15.5
    Checking dashmap v6.2.1
    Checking rand v0.9.4
    Checking fixedbitset v0.5.7
   Compiling datafusion-macros v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/macros)
    Checking petgraph v0.8.3
    Checking tempfile v3.27.0
    Checking datafusion-common-runtime v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/common-runtime)
    Checking glob v0.3.3
    Checking datafusion-common v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/common)
    Checking datafusion-expr-common v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/expr-common)
    Checking datafusion-physical-expr-common v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/physical-expr-common)
    Checking datafusion-functions-aggregate-common v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/functions-aggregate-common)
    Checking datafusion-functions-window-common v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/functions-window-common)
    Checking datafusion-expr v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/expr)
    Checking datafusion-execution v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/execution)
    Checking datafusion-physical-expr v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/physical-expr)
    Checking datafusion-functions v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/functions)
    Checking datafusion-physical-plan v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/physical-plan)
    Checking datafusion-physical-expr-adapter v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/physical-expr-adapter)
    Checking datafusion-session v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/session)
    Checking datafusion-datasource v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/datasource)
    Checking datafusion-pruning v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/pruning)
 Documenting datafusion-datasource-parquet v53.1.0 (/home/runner/work/datafusion/datafusion/datafusion/datasource-parquet)
error[E0432]: unresolved import `parquet::arrow::parquet_to_arrow_schema_and_field_levels`
  --> /home/runner/work/datafusion/datafusion/datafusion/datasource-parquet/src/metadata.rs:46:5
   |
46 |     parquet_to_arrow_schema_and_field_levels,
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ no `parquet_to_arrow_schema_and_field_levels` in `arrow`
   |
help: a similar name exists in the module
   |
46 -     parquet_to_arrow_schema_and_field_levels,
46 +     parquet_to_arrow_field_levels,
   |

error[E0407]: method `get_arrow_reader_metadata` is not a member of trait `AsyncFileReader`
   --> /home/runner/work/datafusion/datafusion/datafusion/datasource-parquet/src/reader.rs:322:5
    |
322 | /     fn get_arrow_reader_metadata<'a>(
323 | |         &'a mut self,
324 | |         options: ArrowReaderOptions,
325 | |     ) -> BoxFuture<'a, parquet::errors::Result<ArrowReaderMetadata>> {
...   |
410 | |         .boxed()
411 | |     }
    | |_____^ not a member of trait `AsyncFileReader`

Some errors have detailed explanations: E0407, E0432.
For more information about an error, try `rustc --explain E0407`.
error: could not document `datafusion-datasource-parquet`

-----

error: failed to build rustdoc for crate datafusion-datasource-parquet v53.1.0
note: this is usually due to a compilation error in the crate,
      and is unlikely to be a bug in cargo-semver-checks
note: the following command can be used to reproduce the error:
      cargo new --lib example &&
          cd example &&
          echo '[workspace]' >> Cargo.toml &&
          cargo add --path /home/runner/work/datafusion/datafusion/datafusion/datasource-parquet --features parquet_encryption &&
          cargo check &&
          cargo doc

error: aborting due to failure to build rustdoc for crate datafusion-datasource-parquet v53.1.0

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant