tldr
For sparse arrays where most chunks resolve to fill_value, the current read path spends most of its time iterating empty chunks. Empirical numbers from a 1D HEALPix array with 49,152 chunks of which ~1,300 are populated (zarr 3.1.5):
| Path |
Wall time |
Speedup |
arr[:] (current) |
173.77 s |
1× |
fast_read_sparse recipe |
2.73 s |
64× |
Background
I've been working on an aggregator called zagg which takes large, out-of-memory point datasets as input, and then aggregates them to a grid; the 'z' in zagg stands for 'zarr', since we write out to zarr. Given that the input datasets are large, we split up the aggregation and assign one worker per write chunk/shard target.
We're iterating on a refactor to support arbitrary output grids-- regular rectilinear grids in whatever projection is specified (Mercator, polar stereographic, etc), and also discrete global grid systems like healpix, H3, or S2. For the discrete global grid systems like healpix, we write to a chunk index that matches the global grid index; these indices scale quadratically with grid cell resolution. For us, that means that even large geographic extent (i.e., all of Antarctica) will only populate a small portion of the full index space.
We pay no penalty for writing out these sparse healpix arrays to zarr; we're not writing any actual data to the sparse empty chunks, and the metadata write is lightweight. However, the sparse reads are slow; right now, arr[:] iterates every chunk in the grid in Python and assigns fill_value per empty chunk, even when nothing is in the store. For the 49,152-chunk example at the top of this issue, ~150 s of the 173 s total wall time is exactly that loop, with zero I/O.
I'd like to propose a fast read path that avoids this by:
- issue a single
store.list_prefix(...) call before reading
- bulk-fill the output buffer with
fill_value once
- read populated chunks on top of fill values (skip the chunk empty reads entirely).
This is about a ~30 line fix, and there's a prototype with the timings already up that you can see here in cell 6 with the fast read path implemented, vs here without the fast read path (also cell 6). With 3% of the chunks empty, we get about a 64x speed improvement reading from S3; running on LocalStore and MemoryStore , we get smaller (but still significant) sparse read improvements when reading sparsely filled chunks:
Benchmarks
Reading the full array (arr[:]) at ~3% sparsity, sweeping chunk count. off is stock arr[:]; on is the same call with the proposed flag.
store n_chunks populated off (s) on (s) speedup
----------------------------------------------------------------------
MemoryStore 1024 32 0.0472 0.0124 3.8x
LocalStore 1024 32 0.2833 0.0210 13.5x
MemoryStore 4096 128 0.2446 0.0385 6.4x
LocalStore 4096 128 1.1863 0.1125 10.5x
MemoryStore 16384 512 0.8203 0.2604 3.2x
LocalStore 16384 512 5.3576 0.5213 10.3x
MemoryStore 49152 1536 3.0066 0.7335 4.1x
LocalStore 49152 1536 14.1174 1.3016 10.8x
(Off and On refer to baseline existing method vs prefetch scan for empty).
Impact / Proposed implementation
This would be added as an opt-in flag that makes Array.__getitem__ (and the underlying selection methods) issue a single store.list_prefix(...) call before reading, then skip the per-chunk store round-trip for chunks that are not present — filling those regions of the output with fill_value directly. Default off; no behavior change unless enabled. To conform to the zarr-python API, this is an order of magnitude change of ~150 lines of code (full PR would be closer to 500 with benchmark and unit tests).
Would a PR along these lines be welcome? I have a working branch with tests and a benchmark and would file it pending feedback on naming, scope, API, etc.
tldr
For sparse arrays where most chunks resolve to
fill_value, the current read path spends most of its time iterating empty chunks. Empirical numbers from a 1D HEALPix array with 49,152 chunks of which ~1,300 are populated (zarr 3.1.5):arr[:](current)fast_read_sparserecipeBackground
I've been working on an aggregator called zagg which takes large, out-of-memory point datasets as input, and then aggregates them to a grid; the 'z' in zagg stands for 'zarr', since we write out to zarr. Given that the input datasets are large, we split up the aggregation and assign one worker per write chunk/shard target.
We're iterating on a refactor to support arbitrary output grids-- regular rectilinear grids in whatever projection is specified (Mercator, polar stereographic, etc), and also discrete global grid systems like healpix, H3, or S2. For the discrete global grid systems like healpix, we write to a chunk index that matches the global grid index; these indices scale quadratically with grid cell resolution. For us, that means that even large geographic extent (i.e., all of Antarctica) will only populate a small portion of the full index space.
We pay no penalty for writing out these sparse healpix arrays to zarr; we're not writing any actual data to the sparse empty chunks, and the metadata write is lightweight. However, the sparse reads are slow; right now,
arr[:]iterates every chunk in the grid in Python and assignsfill_valueper empty chunk, even when nothing is in the store. For the 49,152-chunk example at the top of this issue, ~150 s of the 173 s total wall time is exactly that loop, with zero I/O.I'd like to propose a fast read path that avoids this by:
store.list_prefix(...)call before readingfill_valueonceThis is about a ~30 line fix, and there's a prototype with the timings already up that you can see here in cell 6 with the fast read path implemented, vs here without the fast read path (also cell 6). With 3% of the chunks empty, we get about a 64x speed improvement reading from S3; running on LocalStore and MemoryStore , we get smaller (but still significant) sparse read improvements when reading sparsely filled chunks:
Benchmarks
Reading the full array (
arr[:]) at ~3% sparsity, sweeping chunk count.offis stockarr[:];onis the same call with the proposed flag.(Off and On refer to baseline existing method vs prefetch scan for empty).
Impact / Proposed implementation
This would be added as an opt-in flag that makes
Array.__getitem__(and the underlying selection methods) issue a singlestore.list_prefix(...)call before reading, then skip the per-chunk store round-trip for chunks that are not present — filling those regions of the output withfill_valuedirectly. Default off; no behavior change unless enabled. To conform to the zarr-python API, this is an order of magnitude change of ~150 lines of code (full PR would be closer to 500 with benchmark and unit tests).Would a PR along these lines be welcome? I have a working branch with tests and a benchmark and would file it pending feedback on naming, scope, API, etc.