Add managed-memory advise, prefetch, and discard-prefetch free functions#1775
Add managed-memory advise, prefetch, and discard-prefetch free functions#1775rparolin wants to merge 36 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
|
/ok to test |
|
question: Does making these member functions of the |
I'm moving this back into draft. We discussed in our team meeting because I was already hesitant as Buffer is becoming a 'God object' with the functionality is gaining. We were going to explore alternatives. Free functions sounds like a good alternative to explore. |
…ns in the cuda.core.managed_memory namespace
…ups, fix docs - Remove duplicate long-form "cu_mem_advise_*" string aliases from _MANAGED_ADVICE_ALIASES; users pass short strings or the enum directly - Replace 4 boolean allow_* params in _normalize_managed_location with a single allowed_loctypes frozenset driven by _MANAGED_ADVICE_ALLOWED_LOCTYPES - Cache immutable runtime checks: CU_DEVICE_CPU, v2 bindings flag, discard_prefetch support, and advice enum-to-alias reverse map - Collapse hasattr+getattr to single getattr in _managed_location_enum - Move _require_managed_discard_prefetch_support to top of discard_prefetch for fail-fast behavior - Fix docs build: reset Sphinx module scope after managed_memory section in api.rst so subsequent sections resolve under cuda.core - Add discard_prefetch pool-allocation test and comment on _get_mem_range_attr Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e legacy path The _V2_BINDINGS cache in _buffer.pyx persists across tests, so monkeypatching get_binding_version alone is insufficient when earlier tests have already populated the cache with the v2 value. Promote _V2_BINDINGS from cdef int to a Python-level variable so tests can monkeypatch it directly via monkeypatch.setattr, and reset it to -1 in both legacy-signature tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t real hardware These three tests call cuMemAdvise on real CUDA devices and verify memory range attributes. On devices without concurrent_managed_access (e.g. Windows/WDDM), set_read_mostly silently no-ops and set_preferred_location fails with CUDA_ERROR_INVALID_DEVICE. Use the stricter _skip_if_managed_location_ops_unsupported guard, matching the pattern already used by test_managed_memory_functions_accept_raw_pointer_ranges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s support Reorder checks in discard_prefetch so _normalize_managed_target_range runs before _require_managed_discard_prefetch_support. This ensures non-managed buffers raise ValueError before the RuntimeError for missing cuMemDiscardAndPrefetchBatchAsync support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ps module Move advise, prefetch, and discard_prefetch functions and their helpers out of _buffer.pyx into a new _managed_memory_ops Cython module to improve separation of concerns. Expose _init_mem_attrs and _query_memory_attrs as non-inline cdef functions in _buffer.pxd so the new module can reuse them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Status triage from assignee @cpcloud while looking at #1332 for v1.0.0 (due 2026-05-07): Blocker #718 is resolved ( This PR is currently blocked on design decisions, not code. Summarizing the open threads so we can unblock:
@rparolin — are you planning to resume this PR? No pressure, just want to make sure we don't drop the work between the cracks. @leofang — could you weigh in on (1) and (2)? If those are settled, (3)-(6) are straightforward follow-ups. If the design takes longer than ~a week, I'd suggest bumping #1332 to post-1.0. I can help with rebase + mechanical fixes once the design is locked. |
…m_advise_prefetch # Conflicts: # cuda_bindings/pixi.lock # cuda_core/cuda/core/_memory/_buffer.pyx # cuda_core/docs/source/api.rst # cuda_core/docs/source/release/0.7.x-notes.rst # cuda_core/pixi.lock
Upstream renamed get_binding_version → binding_version and moved it from cuda.core._utils.cuda_utils to cuda.core._utils.version. Update the managed-memory ops module to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cuda.core.experimental namespace is being deprecated and should not gain new submodules. Per review feedback, the managed_memory module should only be reachable via cuda.core.managed_memory, not via the experimental compatibility shim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Frozen dataclass with classmethod constructors for the four CUmemLocationType kinds (device, host, host_numa, host_numa_current). Validates id constraints in __post_init__. Re-exported from cuda.core.managed_memory. This will replace the location=/location_type= kwargs in the upcoming unified 1..N managed-memory ops API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Centralizes back-compat coercion for managed-memory Location inputs: - Location → passthrough - Device → Location.device(device_id) - int >= 0 → Location.device(int) - int == -1 → Location.host() - None → None when allow_none=True, else ValueError Will be used by the unified 1..N managed-memory ops API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The legacy-bindings monkeypatch tests still referenced get_binding_version, which was renamed to binding_version in cf2f20d. Update both occurrences. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review feedback on _buffer.pyx: - Restore `inline` on `_init_mem_attrs` and `_query_memory_attrs`. - Set `out.is_managed = (is_managed != 0)` once outside the if/elif, rather than per-branch (driver leaves the attribute zero for non-managed pointers, so all three branches converged on the same value anyway). - Add a TODO noting that HMM/ATS-enabled sysmem should also report `is_managed=True`; the CU_POINTER_ATTRIBUTE_IS_MANAGED query does not capture that yet. The Cython modernization of _managed_memory_ops.pyx (cimport cydriver, IF/ELSE for the 12/13 ABI split) is folded into Tasks 5-8 where the public API is being rewritten anyway; doing it here would mean rewriting the same call sites twice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite prefetch() with the unified single-or-batched signature targeted by issue NVIDIA#1333: - prefetch(targets, location, *, options=None, stream) - targets accepts a single Buffer or a sequence of Buffers - location accepts a Location dataclass, Device, int (-1 = host), or a sequence broadcasting to per-buffer locations - length mismatch raises ValueError; empty targets raises ValueError - options is reserved for future per-call flags and must be None - stream moved to the end, kept keyword-only Internals: switch from Python-level driver.cuMemPrefetchAsync to Cython-level cydriver.cuMemPrefetchAsync via cimport cydriver, with HANDLE_RETURN. Replace the runtime _V2_BINDINGS check with compile-time IF CUDA_CORE_BUILD_MAJOR >= 13 / ELSE per the codebase precedent in _managed_memory_resource.pyx, _memory_pool.pyx, _tensor_map.pyx. N>1 dispatches to cydriver.cuMemPrefetchBatchAsync (CUDA 13 only); on CUDA 12 builds, batched prefetch raises NotImplementedError. Single-range prefetch continues to work on both CUDA 12 and 13 builds. The location_type= keyword is removed; callers express location kind via the Location dataclass added in 20d036e. The advise() and discard_prefetch() functions still use the legacy _normalize_managed_location helper and Python-level driver calls; they will be migrated in their own tasks. Also drops test_managed_memory_prefetch_uses_legacy_bindings_signature, which monkeypatched the Python-level driver.cuMemPrefetchAsync — no longer applicable since the prefetch path uses cydriver. The corresponding advise legacy-bindings test stays for now (advise still uses Python driver). Closes Andy-Jost's review comment that the existing API is "non-Pythonic" by making it Pythonic in a different direction (typed Location dataclass) while preserving the free-function shape pending Leo's tie-break on ManagedBuffer subclass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new discard(targets, *, options=None, stream) free function that wraps cuMemDiscardBatchAsync. Accepts a single Buffer or a sequence; N>=1 dispatches to the batched driver entry point. Requires a CUDA 13 build of cuda.core (NotImplementedError on CUDA 12 builds). Closes the second of three batched managed-memory operations from NVIDIA#1333: P1: cudaMemDiscardBatchAsync <- this commit P1: cudaMemPrefetchBatchAsync <- 818f5d2 P1: cudaMemDiscardAndPrefetchBatchAsync <- next commit Re-exported from cuda.core.managed_memory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…driver Rewrite discard_prefetch() with the unified single-or-batched signature: discard_prefetch(targets, location, *, options=None, stream) - targets accepts a single Buffer or a sequence of Buffers - location accepts a Location, Device, int, or per-buffer sequence - length mismatch / empty targets raise ValueError - options must be None (reserved) - stream moved to end, kept keyword-only Internals: switch from Python-level driver.cuMemDiscardAndPrefetchBatchAsync to Cython-level cydriver.cuMemDiscardAndPrefetchBatchAsync. The runtime discard-prefetch availability check is replaced by compile-time IF CUDA_CORE_BUILD_MAJOR >= 13 / ELSE; on CUDA 12 builds the call raises NotImplementedError. The location_type= keyword is removed; use Location dataclass instead. Closes the third managed-memory batched op from NVIDIA#1333. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aratus Rewrite advise() with the unified single-or-batched signature: advise(targets, advice, location=None, *, options=None) - targets accepts a single Buffer or a sequence - advice still accepts string aliases or driver.CUmem_advise enum values - location accepts Location dataclass, Device, int, None, or per-buffer sequence (None permitted only for set_read_mostly, unset_read_mostly, unset_preferred_location) - Per-advice allowed-kind validation ported to operate on Location.kind (matches CUDA driver constraints from existing tables) - options reserved for future per-call flags - For N>1, loops cydriver.cuMemAdvise per buffer (no batched advise API exists in CUDA) Internals: switch to cydriver.cuMemAdvise (Cython-level); use compile-time IF CUDA_CORE_BUILD_MAJOR >= 13 / ELSE for the 12/13 ABI split. Drop the legacy apparatus that all four functions previously shared: - _normalize_managed_location (returned Python driver.CUmemLocation) - _make_managed_location, _managed_location_enum - _managed_location_uses_v2_bindings + _V2_BINDINGS lazy cache - _managed_location_to_legacy_device + _LEGACY_LOC_DEVICE/HOST cache - _require_managed_discard_prefetch_support - Unused module-level constants (_HOST_NUMA_CURRENT_ID, _SINGLE_RANGE_COUNT, _MANAGED_OPERATION_FLAGS, etc.) Also drop test_managed_memory_advise_uses_legacy_bindings_signature and the _LEGACY_BINDINGS_VERSION constant; the runtime version switch is gone, replaced by compile-time IF/ELSE that the test could not exercise. The CUDA 12 vs CUDA 13 paths are now covered by the build-matrix CI job. Closes Task 8 (advise) and Task 9 (legacy-bindings test cleanup) from docs/superpowers/plans/2026-04-27-managed-memory-ops-batched.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… ops _require_managed_buffer was poking at Buffer._mem_attrs.is_managed directly via _init_mem_attrs(). PR NVIDIA#1924 added the public Buffer.is_managed property which falls back to MemoryResource.is_managed when the pointer attribute query does not advertise managed memory (the case for pool- allocated managed memory). Switch _require_managed_buffer to the public property. This also fixes a latent bug where pool-allocated managed buffers were being rejected by the managed_memory ops despite Buffer.is_managed correctly reporting True. Drops the no-longer-needed cimport of _init_mem_attrs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
api.rst: add Location and discard to the managed_memory autosummary. 1.0.0-notes.rst: replace the placeholder bullet with a description of the unified 1..N API, the Location dataclass, and the dispatch to batched driver entry points on cuda.bindings 12.8+. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n docstring Per /simplify review, remove WHAT-only comments that just restate the function signature in front of _coerce_buffer_targets and _broadcast_locations. Tighten the _coerce_location docstring to lead with the conversion intent rather than restate the type annotation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ruff auto-applied:
* Drop unused `_managed_memory_ops` test import (no longer needed
after the legacy-bindings monkeypatch test was deleted)
* Drop "Location" string-quoted forward refs in
_managed_location.py (file already uses `from __future__ import
annotations`)
* Reformat string concatenations and add blank-line-after-import
spacing
- cython-lint auto-applied:
* Drop unused libc.stdint cimport of `uintptr_t`
* Drop unused `Location` Python import (only used in docstrings)
* Drop unused `n` local in `discard()`
* Move `cpython.mem cimport` of PyMem_Free / PyMem_Malloc inside
the `IF CUDA_CORE_BUILD_MAJOR >= 13:` block where the symbols
are actually used; cython-lint cannot see across compile-time
branches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Leo's review request (NVIDIA#1775 (comment)), fold the managed-memory free functions and the Location dataclass into cuda.core.utils rather than maintaining a dedicated cuda.core.managed_memory namespace. - Re-export Location, advise, prefetch, discard, discard_prefetch from cuda.core.utils. - Delete cuda.core.managed_memory module. - Update cuda.core.__init__ to drop the managed_memory submodule import. - Update tests to import from cuda.core.utils. - Update api.rst: drop the dedicated Managed memory section; add the managed-memory entries to the Utility functions section. - Update 1.0.0-notes.rst accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds managed-memory
advise(),prefetch(),discard(), anddiscard_prefetch()as free functions under the newcuda.core.managed_memorynamespace. Each function accepts either a singleBufferor a sequence; N==1 dispatches to the per-range CUDA driver entry point and N>1 dispatches to the correspondingcuMem*BatchAsync.Closes #1332. Addresses the managed-memory portion of #1333 (P1:
cuMemPrefetchBatchAsync,cuMemDiscardBatchAsync,cuMemDiscardAndPrefetchBatchAsync). The P0cuMemcpyBatchAsyncfrom #1333 is intentionally out of scope and tracked separately.Public API —
cuda.core.managed_memoryLocationis a frozen dataclass withdevice(int),host(),host_numa(int), andhost_numa_current()classmethod constructors. The previouslocation_type=kwarg has been removed.Implementation notes
cuda_core/cuda/core/_memory/_managed_memory_ops.pyxusescimport cydriverfor direct C-level driver calls (no Python-level attribute lookup per call).cuMemAdviseandcuMemPrefetchAsyncis handled at compile time withIF CUDA_CORE_BUILD_MAJOR >= 13:/ELSE:(matches the codebase precedent in_managed_memory_resource.pyx,_memory_pool.pyx,_tensor_map.pyx).cuMemPrefetchBatchAsync,cuMemDiscardBatchAsync,cuMemDiscardAndPrefetchBatchAsync) are CUDA 13+ only. On CUDA 12 builds, N>1 calls raiseNotImplementedError; single-buffer calls work everywhere._require_managed_bufferuses the publicBuffer.is_managedproperty added in Fix is_managed reporting for pool-allocated managed memory #1924, so pool-allocated managed memory is correctly recognized._buffer.pyxcollapsesout.is_managed = (is_managed != 0)to a single unconditional assignment (Leo's feedback) and adds a TODO noting that HMM/ATS-mapped sysmem is not yet captured byCU_POINTER_ATTRIBUTE_IS_MANAGED.Tests
cuda_core/tests/test_memory.pyaddsTestLocation,TestLocationCoerce,TestPrefetch,TestDiscard,TestDiscardPrefetch, andTestAdvise. Coverage:Location/Device/int/-1(host) /NonerejectionValueErrorValueErroroptionsnon-None raisesTypeErroron every public functionset_accessed_byrejectshost_numaandhost_numa_current)driver.CUmem_adviseenum value both accepted byadviseFull
pixi run -e cu13 pytest cuda_core/tests/passes (2984 passed, 195 skipped on hardware gating, 3 xfailed).Deferred follow-ups
ManagedBuffersubclass with property-style API (buf.read_mostly = Trueetc.) — Andy's suggestion. The current free-function shape is forward-compatible: subclass methods can call the same free functions.is_managedsemantics — flagged as aTODOin_buffer.pyx, tracked alongside the broader HMM/ATS work.cuMemcpyBatchAsync(P0 of Support batched memory movement #1333) — different family, separate PR.*Optionsdataclasses for the four functions —optionsparameter is reserved withNone-only acceptance for now; concrete options classes will land when CUDA introduces per-call flags worth surfacing.