Skip to content

Add managed-memory advise, prefetch, and discard-prefetch free functions#1775

Open
rparolin wants to merge 36 commits intoNVIDIA:mainfrom
rparolin:rparolin/managed_mem_advise_prefetch
Open

Add managed-memory advise, prefetch, and discard-prefetch free functions#1775
rparolin wants to merge 36 commits intoNVIDIA:mainfrom
rparolin:rparolin/managed_mem_advise_prefetch

Conversation

@rparolin
Copy link
Copy Markdown
Collaborator

@rparolin rparolin commented Mar 17, 2026

Summary

Adds managed-memory advise(), prefetch(), discard(), and discard_prefetch() as free functions under the new cuda.core.managed_memory namespace. Each function accepts either a single Buffer or a sequence; N==1 dispatches to the per-range CUDA driver entry point and N>1 dispatches to the corresponding cuMem*BatchAsync.

Closes #1332. Addresses the managed-memory portion of #1333 (P1: cuMemPrefetchBatchAsync, cuMemDiscardBatchAsync, cuMemDiscardAndPrefetchBatchAsync). The P0 cuMemcpyBatchAsync from #1333 is intentionally out of scope and tracked separately.

Public API — cuda.core.managed_memory

from cuda.core.managed_memory import Location, advise, prefetch, discard, discard_prefetch

# Single buffer
prefetch(buf, Location.host(), stream=s)
prefetch(buf, device, stream=s)               # Device → device
prefetch(buf, 0, stream=s)                    # int >= 0 → device
prefetch(buf, -1, stream=s)                   # -1 → host

# Batched
prefetch([b1, b2, b3], Location.device(0), stream=s)             # broadcast
prefetch([b1, b2], [Location.host(), Location.device(0)], stream=s)   # per-buffer

# Other ops follow the same shape
discard([b1, b2], stream=s)                   # CUDA 13+
discard_prefetch(b, Location.host(), stream=s)
advise(b, "set_read_mostly")
advise([b1, b2], "set_preferred_location", [Location.host(), device])

Location is a frozen dataclass with device(int), host(), host_numa(int), and host_numa_current() classmethod constructors. The previous location_type= kwarg has been removed.

Implementation notes

  • Cython implementation in cuda_core/cuda/core/_memory/_managed_memory_ops.pyx uses cimport cydriver for direct C-level driver calls (no Python-level attribute lookup per call).
  • The CUDA 12 / 13 ABI split for cuMemAdvise and cuMemPrefetchAsync is handled at compile time with IF CUDA_CORE_BUILD_MAJOR >= 13: / ELSE: (matches the codebase precedent in _managed_memory_resource.pyx, _memory_pool.pyx, _tensor_map.pyx).
  • Batched entry points (cuMemPrefetchBatchAsync, cuMemDiscardBatchAsync, cuMemDiscardAndPrefetchBatchAsync) are CUDA 13+ only. On CUDA 12 builds, N>1 calls raise NotImplementedError; single-buffer calls work everywhere.
  • _require_managed_buffer uses the public Buffer.is_managed property added in Fix is_managed reporting for pool-allocated managed memory #1924, so pool-allocated managed memory is correctly recognized.
  • _buffer.pyx collapses out.is_managed = (is_managed != 0) to a single unconditional assignment (Leo's feedback) and adds a TODO noting that HMM/ATS-mapped sysmem is not yet captured by CU_POINTER_ATTRIBUTE_IS_MANAGED.

Tests

cuda_core/tests/test_memory.py adds TestLocation, TestLocationCoerce, TestPrefetch, TestDiscard, TestDiscardPrefetch, and TestAdvise. Coverage:

  • Single buffer with Location / Device / int / -1 (host) / None rejection
  • Batched with single broadcast location and with per-buffer location list
  • Length mismatch raises ValueError
  • Empty targets raises ValueError
  • options non-None raises TypeError on every public function
  • Non-managed buffer rejected on every public function
  • Per-advice allowed-location-kind validation (e.g. set_accessed_by rejects host_numa and host_numa_current)
  • String alias and driver.CUmem_advise enum value both accepted by advise

Full pixi run -e cu13 pytest cuda_core/tests/ passes (2984 passed, 195 skipped on hardware gating, 3 xfailed).

Deferred follow-ups

  • ManagedBuffer subclass with property-style API (buf.read_mostly = True etc.) — Andy's suggestion. The current free-function shape is forward-compatible: subclass methods can call the same free functions.
  • HMM/ATS-aware is_managed semantics — flagged as a TODO in _buffer.pyx, tracked alongside the broader HMM/ATS work.
  • cuMemcpyBatchAsync (P0 of Support batched memory movement #1333) — different family, separate PR.
  • Concrete *Options dataclasses for the four functions — options parameter is reserved with None-only acceptance for now; concrete options classes will land when CUDA introduces per-call flags worth surfacing.

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Mar 17, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rparolin rparolin requested a review from Andy-Jost March 17, 2026 00:41
@rparolin rparolin self-assigned this Mar 17, 2026
@rparolin rparolin added this to the cuda.core v0.7.0 milestone Mar 17, 2026
@rparolin rparolin marked this pull request as ready for review March 17, 2026 00:45
@rparolin rparolin marked this pull request as draft March 17, 2026 00:45
@rparolin rparolin changed the title wip Add managed-memory advise, prefetch, and discard-prefetch on Buffer Mar 17, 2026
@rparolin rparolin marked this pull request as ready for review March 17, 2026 00:57
@github-actions
Copy link
Copy Markdown

@rparolin
Copy link
Copy Markdown
Collaborator Author

/ok to test

@jrhemstad
Copy link
Copy Markdown

question: Does making these member functions of the Buffer type preclude this functionality for allocations that weren't created through the Buffer type? Did we consider making these free functions instead of member functions on the Buffer type?

@rparolin
Copy link
Copy Markdown
Collaborator Author

rparolin commented Mar 17, 2026

question: Does making these member functions of the Buffer type preclude this functionality for allocations that weren't created through the Buffer type? Did we consider making these free functions instead of member functions on the Buffer type?

I'm moving this back into draft. We discussed in our team meeting because I was already hesitant as Buffer is becoming a 'God object' with the functionality is gaining. We were going to explore alternatives. Free functions sounds like a good alternative to explore.

@rparolin rparolin marked this pull request as draft March 17, 2026 19:35
@rparolin rparolin marked this pull request as ready for review March 17, 2026 23:46
rparolin and others added 7 commits March 17, 2026 17:30
…ups, fix docs

- Remove duplicate long-form "cu_mem_advise_*" string aliases from
  _MANAGED_ADVICE_ALIASES; users pass short strings or the enum directly
- Replace 4 boolean allow_* params in _normalize_managed_location with a
  single allowed_loctypes frozenset driven by _MANAGED_ADVICE_ALLOWED_LOCTYPES
- Cache immutable runtime checks: CU_DEVICE_CPU, v2 bindings flag,
  discard_prefetch support, and advice enum-to-alias reverse map
- Collapse hasattr+getattr to single getattr in _managed_location_enum
- Move _require_managed_discard_prefetch_support to top of discard_prefetch
  for fail-fast behavior
- Fix docs build: reset Sphinx module scope after managed_memory section in
  api.rst so subsequent sections resolve under cuda.core
- Add discard_prefetch pool-allocation test and comment on _get_mem_range_attr

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e legacy path

The _V2_BINDINGS cache in _buffer.pyx persists across tests, so
monkeypatching get_binding_version alone is insufficient when earlier
tests have already populated the cache with the v2 value. Promote
_V2_BINDINGS from cdef int to a Python-level variable so tests can
monkeypatch it directly via monkeypatch.setattr, and reset it to -1
in both legacy-signature tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t real hardware

These three tests call cuMemAdvise on real CUDA devices and verify
memory range attributes. On devices without concurrent_managed_access
(e.g. Windows/WDDM), set_read_mostly silently no-ops and
set_preferred_location fails with CUDA_ERROR_INVALID_DEVICE. Use the
stricter _skip_if_managed_location_ops_unsupported guard, matching the
pattern already used by test_managed_memory_functions_accept_raw_pointer_ranges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s support

Reorder checks in discard_prefetch so _normalize_managed_target_range
runs before _require_managed_discard_prefetch_support. This ensures
non-managed buffers raise ValueError before the RuntimeError for missing
cuMemDiscardAndPrefetchBatchAsync support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ps module

Move advise, prefetch, and discard_prefetch functions and their helpers
out of _buffer.pyx into a new _managed_memory_ops Cython module to
improve separation of concerns. Expose _init_mem_attrs and
_query_memory_attrs as non-inline cdef functions in _buffer.pxd so the
new module can reuse them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cpcloud
Copy link
Copy Markdown
Contributor

cpcloud commented Apr 23, 2026

Status triage from assignee @cpcloud while looking at #1332 for v1.0.0 (due 2026-05-07):

Blocker #718 is resolved (ManagedMemoryResource merged 2025-12-16), so the issue itself is unblocked.

This PR is currently blocked on design decisions, not code. Summarizing the open threads so we can unblock:

  1. Free functions vs. ManagedBuffer subclass — Andy-Jost proposed a subclass with properties (preferred_location, read_mostly, accessed_by set, .prefetch() method). @rparolin asked @leofang to tie-break on 2026-03-19; no public resolution yet.
  2. Holistic plan for batched APIs@leofang's CHANGES_REQUESTED (2026-03-25) asks for a joint plan with Support batched memory movement #1333 (batched memory movement) before 1.0. This is the top blocker on this PR merging.
  3. Module placement@leofang prefers folding the 3 functions into cuda.core.utils rather than a dedicated cuda.core.managed_memory namespace.
  4. Options dataclass@leofang (2026-03-29) notes cuda.core uses option dataclasses (e.g. ManagedMemoryResourceOptions), but this PR uses kwargs.
  5. Mechanical cleanup — cimport cydriver + IF/ELSE for the 12/13 split, drop changes to cuda.core.experimental/__init__.py, drop cuda_bindings/pixi.lock and test_experimental_backward_compat.py changes, <Buffer?> cast, HMM/ATS is_managed TODO, reason for removing inline from _init_mem_attrs.
  6. Rebase regressions — the branch has merge conflicts; Fix managed memory misclassified as kDLCUDAHost in DLPack device mapping #1863 (kDLCUDAManaged fix) and Fix is_managed reporting for pool-allocated managed memory #1924 (Buffer.is_managed/MemoryResource.is_managed for pool-allocated managed memory) both landed after this PR's last push and the current branch would regress them on rebase.

@rparolin — are you planning to resume this PR? No pressure, just want to make sure we don't drop the work between the cracks.

@leofang — could you weigh in on (1) and (2)? If those are settled, (3)-(6) are straightforward follow-ups. If the design takes longer than ~a week, I'd suggest bumping #1332 to post-1.0.

I can help with rebase + mechanical fixes once the design is locked.

@leofang leofang assigned Andy-Jost and unassigned rparolin Apr 24, 2026
@leofang leofang marked this pull request as draft April 24, 2026 03:09
@rparolin rparolin self-assigned this Apr 27, 2026
…m_advise_prefetch

# Conflicts:
#	cuda_bindings/pixi.lock
#	cuda_core/cuda/core/_memory/_buffer.pyx
#	cuda_core/docs/source/api.rst
#	cuda_core/docs/source/release/0.7.x-notes.rst
#	cuda_core/pixi.lock
@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label Apr 27, 2026
rparolin and others added 12 commits April 27, 2026 16:46
Upstream renamed get_binding_version → binding_version and moved it from
cuda.core._utils.cuda_utils to cuda.core._utils.version. Update the
managed-memory ops module to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cuda.core.experimental namespace is being deprecated and should not
gain new submodules. Per review feedback, the managed_memory module
should only be reachable via cuda.core.managed_memory, not via the
experimental compatibility shim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Frozen dataclass with classmethod constructors for the four CUmemLocationType
kinds (device, host, host_numa, host_numa_current). Validates id constraints
in __post_init__. Re-exported from cuda.core.managed_memory.

This will replace the location=/location_type= kwargs in the upcoming
unified 1..N managed-memory ops API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Centralizes back-compat coercion for managed-memory Location inputs:
- Location → passthrough
- Device → Location.device(device_id)
- int >= 0 → Location.device(int)
- int == -1 → Location.host()
- None → None when allow_none=True, else ValueError

Will be used by the unified 1..N managed-memory ops API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The legacy-bindings monkeypatch tests still referenced get_binding_version,
which was renamed to binding_version in cf2f20d. Update both occurrences.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review feedback on _buffer.pyx:

- Restore `inline` on `_init_mem_attrs` and `_query_memory_attrs`.
- Set `out.is_managed = (is_managed != 0)` once outside the if/elif,
  rather than per-branch (driver leaves the attribute zero for
  non-managed pointers, so all three branches converged on the same
  value anyway).
- Add a TODO noting that HMM/ATS-enabled sysmem should also report
  `is_managed=True`; the CU_POINTER_ATTRIBUTE_IS_MANAGED query does
  not capture that yet.

The Cython modernization of _managed_memory_ops.pyx (cimport cydriver,
IF/ELSE for the 12/13 ABI split) is folded into Tasks 5-8 where the
public API is being rewritten anyway; doing it here would mean
rewriting the same call sites twice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite prefetch() with the unified single-or-batched signature targeted by
issue NVIDIA#1333:

- prefetch(targets, location, *, options=None, stream)
- targets accepts a single Buffer or a sequence of Buffers
- location accepts a Location dataclass, Device, int (-1 = host), or a
  sequence broadcasting to per-buffer locations
- length mismatch raises ValueError; empty targets raises ValueError
- options is reserved for future per-call flags and must be None
- stream moved to the end, kept keyword-only

Internals: switch from Python-level driver.cuMemPrefetchAsync to
Cython-level cydriver.cuMemPrefetchAsync via cimport cydriver, with
HANDLE_RETURN. Replace the runtime _V2_BINDINGS check with compile-time
IF CUDA_CORE_BUILD_MAJOR >= 13 / ELSE per the codebase precedent in
_managed_memory_resource.pyx, _memory_pool.pyx, _tensor_map.pyx.

N>1 dispatches to cydriver.cuMemPrefetchBatchAsync (CUDA 13 only); on
CUDA 12 builds, batched prefetch raises NotImplementedError. Single-range
prefetch continues to work on both CUDA 12 and 13 builds.

The location_type= keyword is removed; callers express location kind via
the Location dataclass added in 20d036e.

The advise() and discard_prefetch() functions still use the legacy
_normalize_managed_location helper and Python-level driver calls; they
will be migrated in their own tasks.

Also drops test_managed_memory_prefetch_uses_legacy_bindings_signature,
which monkeypatched the Python-level driver.cuMemPrefetchAsync — no
longer applicable since the prefetch path uses cydriver. The corresponding
advise legacy-bindings test stays for now (advise still uses Python driver).

Closes Andy-Jost's review comment that the existing API is "non-Pythonic"
by making it Pythonic in a different direction (typed Location dataclass)
while preserving the free-function shape pending Leo's tie-break on
ManagedBuffer subclass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new discard(targets, *, options=None, stream) free function that
wraps cuMemDiscardBatchAsync. Accepts a single Buffer or a sequence;
N>=1 dispatches to the batched driver entry point. Requires a CUDA 13
build of cuda.core (NotImplementedError on CUDA 12 builds).

Closes the second of three batched managed-memory operations from NVIDIA#1333:
  P1: cudaMemDiscardBatchAsync               <- this commit
  P1: cudaMemPrefetchBatchAsync              <- 818f5d2
  P1: cudaMemDiscardAndPrefetchBatchAsync    <- next commit

Re-exported from cuda.core.managed_memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…driver

Rewrite discard_prefetch() with the unified single-or-batched signature:

  discard_prefetch(targets, location, *, options=None, stream)

- targets accepts a single Buffer or a sequence of Buffers
- location accepts a Location, Device, int, or per-buffer sequence
- length mismatch / empty targets raise ValueError
- options must be None (reserved)
- stream moved to end, kept keyword-only

Internals: switch from Python-level driver.cuMemDiscardAndPrefetchBatchAsync
to Cython-level cydriver.cuMemDiscardAndPrefetchBatchAsync. The runtime
discard-prefetch availability check is replaced by compile-time
IF CUDA_CORE_BUILD_MAJOR >= 13 / ELSE; on CUDA 12 builds the call raises
NotImplementedError.

The location_type= keyword is removed; use Location dataclass instead.

Closes the third managed-memory batched op from NVIDIA#1333.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aratus

Rewrite advise() with the unified single-or-batched signature:

  advise(targets, advice, location=None, *, options=None)

- targets accepts a single Buffer or a sequence
- advice still accepts string aliases or driver.CUmem_advise enum values
- location accepts Location dataclass, Device, int, None, or per-buffer
  sequence (None permitted only for set_read_mostly, unset_read_mostly,
  unset_preferred_location)
- Per-advice allowed-kind validation ported to operate on Location.kind
  (matches CUDA driver constraints from existing tables)
- options reserved for future per-call flags
- For N>1, loops cydriver.cuMemAdvise per buffer (no batched advise API
  exists in CUDA)

Internals: switch to cydriver.cuMemAdvise (Cython-level); use compile-time
IF CUDA_CORE_BUILD_MAJOR >= 13 / ELSE for the 12/13 ABI split.

Drop the legacy apparatus that all four functions previously shared:
- _normalize_managed_location (returned Python driver.CUmemLocation)
- _make_managed_location, _managed_location_enum
- _managed_location_uses_v2_bindings + _V2_BINDINGS lazy cache
- _managed_location_to_legacy_device + _LEGACY_LOC_DEVICE/HOST cache
- _require_managed_discard_prefetch_support
- Unused module-level constants (_HOST_NUMA_CURRENT_ID,
  _SINGLE_RANGE_COUNT, _MANAGED_OPERATION_FLAGS, etc.)

Also drop test_managed_memory_advise_uses_legacy_bindings_signature and
the _LEGACY_BINDINGS_VERSION constant; the runtime version switch is
gone, replaced by compile-time IF/ELSE that the test could not exercise.
The CUDA 12 vs CUDA 13 paths are now covered by the build-matrix CI job.

Closes Task 8 (advise) and Task 9 (legacy-bindings test cleanup) from
docs/superpowers/plans/2026-04-27-managed-memory-ops-batched.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… ops

_require_managed_buffer was poking at Buffer._mem_attrs.is_managed
directly via _init_mem_attrs(). PR NVIDIA#1924 added the public Buffer.is_managed
property which falls back to MemoryResource.is_managed when the pointer
attribute query does not advertise managed memory (the case for pool-
allocated managed memory).

Switch _require_managed_buffer to the public property. This also fixes
a latent bug where pool-allocated managed buffers were being rejected
by the managed_memory ops despite Buffer.is_managed correctly reporting
True.

Drops the no-longer-needed cimport of _init_mem_attrs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
api.rst: add Location and discard to the managed_memory autosummary.

1.0.0-notes.rst: replace the placeholder bullet with a description of the
unified 1..N API, the Location dataclass, and the dispatch to batched
driver entry points on cuda.bindings 12.8+.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rparolin rparolin added the feature New feature or request label Apr 28, 2026
…n docstring

Per /simplify review, remove WHAT-only comments that just restate the
function signature in front of _coerce_buffer_targets and
_broadcast_locations. Tighten the _coerce_location docstring to lead
with the conversion intent rather than restate the type annotation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rparolin rparolin requested review from Andy-Jost and leofang April 28, 2026 01:25
@rparolin rparolin marked this pull request as ready for review April 28, 2026 01:25
@leofang leofang added the P1 Medium priority - Should do label Apr 28, 2026
rparolin and others added 2 commits April 27, 2026 18:41
- ruff auto-applied:
  * Drop unused `_managed_memory_ops` test import (no longer needed
    after the legacy-bindings monkeypatch test was deleted)
  * Drop "Location" string-quoted forward refs in
    _managed_location.py (file already uses `from __future__ import
    annotations`)
  * Reformat string concatenations and add blank-line-after-import
    spacing
- cython-lint auto-applied:
  * Drop unused libc.stdint cimport of `uintptr_t`
  * Drop unused `Location` Python import (only used in docstrings)
  * Drop unused `n` local in `discard()`
  * Move `cpython.mem cimport` of PyMem_Free / PyMem_Malloc inside
    the `IF CUDA_CORE_BUILD_MAJOR >= 13:` block where the symbols
    are actually used; cython-lint cannot see across compile-time
    branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Leo's review request (NVIDIA#1775 (comment)),
fold the managed-memory free functions and the Location dataclass into
cuda.core.utils rather than maintaining a dedicated cuda.core.managed_memory
namespace.

- Re-export Location, advise, prefetch, discard, discard_prefetch from
  cuda.core.utils.
- Delete cuda.core.managed_memory module.
- Update cuda.core.__init__ to drop the managed_memory submodule import.
- Update tests to import from cuda.core.utils.
- Update api.rst: drop the dedicated Managed memory section; add the
  managed-memory entries to the Utility functions section.
- Update 1.0.0-notes.rst accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P1 Medium priority - Should do

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support managed memory advise, prefetch, and discard-prefetch

5 participants