Skip to content

Enhancement proposal: empty-chunk-aware read path (array.prefetch_populated_keys) #3929

@espg

Description

@espg

tldr

For sparse arrays where most chunks resolve to fill_value, the current read path spends most of its time iterating empty chunks. Empirical numbers from a 1D HEALPix array with 49,152 chunks of which ~1,300 are populated (zarr 3.1.5):

Path Wall time Speedup
arr[:] (current) 173.77 s
fast_read_sparse recipe 2.73 s 64×

Background

I've been working on an aggregator called zagg which takes large, out-of-memory point datasets as input, and then aggregates them to a grid; the 'z' in zagg stands for 'zarr', since we write out to zarr. Given that the input datasets are large, we split up the aggregation and assign one worker per write chunk/shard target.

We're iterating on a refactor to support arbitrary output grids-- regular rectilinear grids in whatever projection is specified (Mercator, polar stereographic, etc), and also discrete global grid systems like healpix, H3, or S2. For the discrete global grid systems like healpix, we write to a chunk index that matches the global grid index; these indices scale quadratically with grid cell resolution. For us, that means that even large geographic extent (i.e., all of Antarctica) will only populate a small portion of the full index space.

We pay no penalty for writing out these sparse healpix arrays to zarr; we're not writing any actual data to the sparse empty chunks, and the metadata write is lightweight. However, the sparse reads are slow; right now, arr[:] iterates every chunk in the grid in Python and assigns fill_value per empty chunk, even when nothing is in the store. For the 49,152-chunk example at the top of this issue, ~150 s of the 173 s total wall time is exactly that loop, with zero I/O.

I'd like to propose a fast read path that avoids this by:

  1. issue a single store.list_prefix(...) call before reading
  2. bulk-fill the output buffer with fill_value once
  3. read populated chunks on top of fill values (skip the chunk empty reads entirely).

This is about a ~30 line fix, and there's a prototype with the timings already up that you can see here in cell 6 with the fast read path implemented, vs here without the fast read path (also cell 6). With 3% of the chunks empty, we get about a 64x speed improvement reading from S3; running on LocalStore and MemoryStore , we get smaller (but still significant) sparse read improvements when reading sparsely filled chunks:

Benchmarks

Reading the full array (arr[:]) at ~3% sparsity, sweeping chunk count. off is stock arr[:]; on is the same call with the proposed flag.

  store           n_chunks  populated     off (s)      on (s)    speedup
  ----------------------------------------------------------------------
  MemoryStore         1024         32      0.0472      0.0124       3.8x
  LocalStore          1024         32      0.2833      0.0210      13.5x
  MemoryStore         4096        128      0.2446      0.0385       6.4x
  LocalStore          4096        128      1.1863      0.1125      10.5x
  MemoryStore        16384        512      0.8203      0.2604       3.2x
  LocalStore         16384        512      5.3576      0.5213      10.3x
  MemoryStore        49152       1536      3.0066      0.7335       4.1x
  LocalStore         49152       1536     14.1174      1.3016      10.8x

(Off and On refer to baseline existing method vs prefetch scan for empty).

Impact / Proposed implementation

This would be added as an opt-in flag that makes Array.__getitem__ (and the underlying selection methods) issue a single store.list_prefix(...) call before reading, then skip the per-chunk store round-trip for chunks that are not present — filling those regions of the output with fill_value directly. Default off; no behavior change unless enabled. To conform to the zarr-python API, this is an order of magnitude change of ~150 lines of code (full PR would be closer to 500 with benchmark and unit tests).

Would a PR along these lines be welcome? I have a working branch with tests and a benchmark and would file it pending feedback on naming, scope, API, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions