Enhancement proposal: empty-chunk-aware read path (`array.prefetch_populated_keys`)

## tldr

For sparse arrays where most chunks resolve to `fill_value`, the current read path spends most of its time iterating empty chunks. Empirical numbers from a 1D HEALPix array with 49,152 chunks of which ~1,300 are populated (zarr 3.1.5):

| Path | Wall time | Speedup |
|---|---|---|
| `arr[:]` (current) | 173.77 s | 1× |
| `fast_read_sparse` recipe | 2.73 s | **64×** |

## Background

I've been working on an aggregator called [zagg](https://github.com/englacial/zagg) which takes large, out-of-memory point datasets as input, and then aggregates them to a grid; the 'z' in zagg stands for 'zarr', since we write out to zarr. Given that the input datasets are large, we split up the aggregation and assign one worker per write chunk/shard target.

We're [iterating](https://github.com/englacial/zagg/blob/main/bench/REPORT.md#2-the-protocol-gap-that-matters-most-shard_of-vs-block_index) on [a refactor](https://github.com/englacial/zagg/pull/17) to support arbitrary output grids-- regular rectilinear grids in whatever projection is specified (Mercator, polar stereographic, etc), and also discrete global grid systems like healpix, H3, or S2. For the discrete global grid systems like healpix, we write to a chunk index that matches the global grid index; these indices scale quadratically with grid cell resolution. For us, that means that even large geographic extent (i.e., all of Antarctica) will only populate a small portion of the full index space.

We pay no penalty for *writing* out these sparse healpix arrays to zarr; we're not writing any actual data to the sparse empty chunks, and the metadata write is lightweight. However, the sparse ***reads*** are slow; right now, `arr[:]` iterates every chunk in the grid in Python and assigns `fill_value` per empty chunk, even when nothing is in the store. For the 49,152-chunk example at the top of this issue, ~150 s of the 173 s total wall time is exactly that loop, with zero I/O.

I'd like to propose a fast read path that avoids this by:

 1. issue a single `store.list_prefix(...)` call before reading
 2. bulk-fill the output buffer with `fill_value` once
 3. read populated chunks on top of fill values (skip the chunk empty reads entirely).

This is about a ~30 line fix, and there's a prototype with the timings already up that you can [see here](https://github.com/englacial/zagg/blob/main/bench/layout_access_numpy.ipynb) in cell 6 with the fast read path implemented, vs [here](https://github.com/englacial/zagg/blob/main/bench/layout_access.ipynb) without the fast read path (also cell 6). With 3% of the chunks empty, we get about a 64x speed improvement reading from S3; running on LocalStore and MemoryStore , we get smaller (but still significant) sparse read improvements when reading sparsely filled chunks:

## Benchmarks

Reading the full array (`arr[:]`) at ~3% sparsity, sweeping chunk count. `off` is stock `arr[:]`; `on` is the same call with the proposed flag.

```
  store           n_chunks  populated     off (s)      on (s)    speedup
  ----------------------------------------------------------------------
  MemoryStore         1024         32      0.0472      0.0124       3.8x
  LocalStore          1024         32      0.2833      0.0210      13.5x
  MemoryStore         4096        128      0.2446      0.0385       6.4x
  LocalStore          4096        128      1.1863      0.1125      10.5x
  MemoryStore        16384        512      0.8203      0.2604       3.2x
  LocalStore         16384        512      5.3576      0.5213      10.3x
  MemoryStore        49152       1536      3.0066      0.7335       4.1x
  LocalStore         49152       1536     14.1174      1.3016      10.8x
```  
(Off and On refer to baseline existing method vs prefetch scan for empty).

## Impact / Proposed implementation

This would be added as an opt-in flag that makes `Array.__getitem__` (and the underlying selection methods) issue a single `store.list_prefix(...)` call before reading, then skip the per-chunk store round-trip for chunks that are not present — filling those regions of the output with `fill_value` directly. Default off; no behavior change unless enabled. To conform to the zarr-python API, this is an order of magnitude change of ~150 lines of code (full PR would be closer to 500 with benchmark and unit tests).

Would a PR along these lines be welcome? I have a [working branch](https://github.com/zarr-developers/zarr-python/compare/main...englacial:zarr-python:feat/prefetch-populated-keys) with tests and a benchmark and would file it pending feedback on naming, scope, API, etc. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhancement proposal: empty-chunk-aware read path (`array.prefetch_populated_keys`) #3929

tldr

Background

Benchmarks

Impact / Proposed implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Path	Wall time	Speedup
`arr[:]` (current)	173.77 s	1×
`fast_read_sparse` recipe	2.73 s	64×

Uh oh!

Enhancement proposal: empty-chunk-aware read path (array.prefetch_populated_keys) #3929

Description

tldr

Background

Benchmarks

Impact / Proposed implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Enhancement proposal: empty-chunk-aware read path (`array.prefetch_populated_keys`) #3929