Skip to content

Support glob patterns in open_datatree(group=...) for selective group loading#11302

Open
aladinor wants to merge 15 commits intopydata:mainfrom
aladinor:glob-group-filtering-standalone
Open

Support glob patterns in open_datatree(group=...) for selective group loading#11302
aladinor wants to merge 15 commits intopydata:mainfrom
aladinor:glob-group-filtering-standalone

Conversation

@aladinor
Copy link
Copy Markdown
Contributor

Summary

When the group parameter contains glob metacharacters (*, ?, [), filter which groups are opened instead of re-rooting the tree. This avoids loading the entire hierarchy when only a subset is needed.

Use cases

  • Radar data: xr.open_datatree("radar.nc", group="*/sweep_0") — load only the lowest elevation sweep from each volume scan
  • CMIP archives: xr.open_datatree("cmip.zarr", group="*/historical/tas") — load only temperature across all models

Changes

  • Added shared utilities _is_glob_pattern, _filter_group_paths, and _resolve_group_and_filter in common.py
  • Updated NetCDF4, H5NetCDF, and Zarr backends to use a discover → filter → open pipeline
  • Uses the same matching engine as DataTree.match() (PurePosixPath.match)
  • Root (/) and all ancestors of matched nodes are always included to form a valid tree

Behavior summary

group value Behavior
None Load all groups (unchanged)
"VCP-34" (no glob chars) Root selection (unchanged)
"*/sweep_0" (glob chars) Filter mode — only matched groups + ancestors
Pattern matches nothing Root-only tree

Test plan

  • 27 new tests covering all backends (netCDF4, h5netcdf, zarr v2/v3)
  • Unit tests for _is_glob_pattern, _filter_group_paths, _resolve_group_and_filter with *, ?, []
  • Integration tests: glob match, no-match, data preservation, open_groups API
  • Full test_backends_datatree.py suite passes (228 passed, 0 failures)
  • Pre-commit checks pass

@github-actions github-actions Bot added topic-backends topic-zarr Related to zarr storage library io labels Apr 16, 2026
Add _is_glob_pattern, _filter_group_paths, and _resolve_group_and_filter
to common.py for detecting and applying glob patterns to group paths.
Use _resolve_group_and_filter in open_groups_as_dict to support glob
patterns in the group parameter for selective group loading.
Use _resolve_group_and_filter in open_groups_as_dict to support glob
patterns in the group parameter for selective group loading.
Use _resolve_group_and_filter in open_groups_as_dict to support glob
patterns in the group parameter for selective group loading.
Update docstrings for the group kwarg in open_datatree and open_groups
to describe glob metacharacter behavior.
Add integration tests for netCDF4, h5netcdf, and zarr backends, plus
unit tests for _is_glob_pattern, _filter_group_paths, and
_resolve_group_and_filter covering *, ?, and [] metacharacters.
@aladinor aladinor force-pushed the glob-group-filtering-standalone branch from e892524 to 5fb46e1 Compare April 16, 2026 17:09
@kmuehlbauer
Copy link
Copy Markdown
Contributor

@aladinor Thanks, that's great a feature. I'd instantly use it.

There might be some pitfalls if group names are containing one or more of the glob meta characters. Will this be handled, too?

my_nifty_group_with_a_star_*_01
my_nifty_group_with_a_star_*_11
my_nifty_group_with_a_star_*_12

@kmuehlbauer
Copy link
Copy Markdown
Contributor

XRef: h5py/h5py#2059 for discussion of adding globbing in h5py

@aladinor
Copy link
Copy Markdown
Contributor Author

aladinor commented Apr 22, 2026

@kmuehlbauer, thanks for taking the time to check this out.

my_nifty_group_with_a_star_01
my_nifty_group_with_a_star
11
my_nifty_group_with_a_star
*_12

This seems to be a strange way to name a group, but yes. It will work via the same character-class escape that fnmatch / PurePath.match supports.

For example, if we have something like this

  paths = ['/my_nifty_group_with_a_star_*_01',
           '/my_nifty_group_with_a_star_*_11',                                                                                                                                                                         
           '/my_nifty_group_with_a_star_*_12']      

We can use this pattern to get those groups "*star_[*]_*". This will match all 3. literal * via [*]

Add coverage for group names containing literal ``*`` / ``?`` / ``[``.
These are reachable with ``[*]`` / ``[?]`` / ``[[]`` character-class
escaping (inherited from ``fnmatch`` / ``PurePath.match`` semantics).

New tests:
- ``test_open_datatree_glob_char_class_escape_literal_metachar`` on
  ``NetCDFIOBase`` and ``TestZarrDatatreeIO`` — end-to-end verification
  that groups with literal metacharacters in their names can be
  targeted across all supported backends.
- ``test_filter_group_paths_literal_metachar_via_char_class`` on
  ``TestGlobPatternUtilities`` — unit-level check of the filter.
Explain that matching follows ``fnmatch`` / :py:meth:`pathlib.PurePath.match`
semantics and that literal ``*`` / ``?`` / ``[`` in group names can be
targeted via character-class escapes (``[*]``, ``[?]``, ``[[]``), with a
short example. Applied to both :py:func:`open_datatree` and
:py:func:`open_groups` for consistency.
Add ``/plain_01`` to the zarr ``test_open_datatree_glob_char_class_escape_literal_metachar``
fixture so it matches the NetCDF version and confirms plain (no-metachar)
group names are excluded when the pattern targets literal-metachar names.
Windows forbids ``*`` and ``?`` in filesystem directory/file names, and
zarr stores each group as an on-disk directory. That makes writing the
fixture impossible before the test can exercise the filter. NetCDF4/H5
store groups inside the HDF5 container so they are unaffected.

Skip the zarr variant on Windows with a clear reason; the NetCDF
variants still cover the escape behavior on all platforms.
The previous commit skipped the zarr variant on Windows because the
filesystem rejects ``*`` and ``?`` in directory names. Using
``zarr.storage.MemoryStore`` side-steps the filesystem entirely, so the
test now runs on every platform and still exercises the escape logic.

This is also a more realistic target for the feature on Windows — users
who hit group names with glob metacharacters are likely reading from
cloud/icechunk stores (dict-keyed like ``MemoryStore``), not an on-disk
zarr directory tree.
``open_datatree``'s static signature doesn't list zarr store objects
(``MemoryStore`` etc.) among its accepted first-argument types, but the
zarr backend handles them correctly at runtime. Apply a narrow
``# type: ignore[arg-type]`` on the three test calls rather than
widening the public signature.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

io topic-backends topic-zarr Related to zarr storage library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support glob patterns in open_datatree(group=...) for selective group loading

2 participants