[FEAT][Python] Tie Python wrapper lifetime to underlying C++ FFI object#593
[FEAT][Python] Tie Python wrapper lifetime to underlying C++ FFI object#593cyx-6 wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces 'PyObjectTying,' a mechanism to link the lifetime of Python wrappers to their underlying C++ FFI objects. By prepending a PyCustomizeAllocHeader to object allocations, the implementation enables stable object identity and address preservation across wrapper finalization cycles. The changes include modifications to the C++ SimpleObjAllocator, Cython bindings to utilize tp_finalize, and updates to the reflection system. Feedback from the review highlights a critical thread-safety issue in the callback used to free cached wrapper memory, which lacks GIL protection, and a potential memory leak when installing handles if a previous phantom wrapper is not correctly released.
435caeb to
1b751b2
Compare
Make `a.x is a.x`, `id(a.x)` stable across drop+refetch, and `f(x) is x`
for FFI returns by attaching a 16-byte custom-allocator header to every
Object allocation. Implements the PyObjectTying design.
Design:
- Two-layer custom-allocator hook in core libtvm_ffi.so:
`TVMFFIObjectAllocHeader { delete_space }`,
`TVMFFICustomAllocator { allocate, context }`, plus
`TVMFFIGetCustomAllocator` / `TVMFFISetCustomAllocator`. libtvm_ffi
installs a builtin default at registry init, so every Object always
carries the 8-byte base header and `TVMFFIGetCustomAllocator` never
returns NULL. The Python Cython module overrides the global default
at module load with `TVMFFIPyAllocate`, which prepends a 16-byte
`PyCustomAllocHeader` encoding the Python wrapper binding.
`make_object` / `make_inplace_array_object` / `PyClassDeleter` and
the Rust `ObjectArc::new[_with_extra_items]` paths all funnel through
the registry, so Python-defined types and Rust-allocated objects
share the layout and lifetime semantics.
- State machine concentrated in `tvm_ffi_python_helpers.h`.
`PyCustomAllocHeader` is `{ PyObject* cached_pyobj; base; }` (16 B).
The Active vs Inactive distinction lives on the wrapper's own
`chandle` field, not on a tag bit, so the header layout stays
pointer-equality clean:
Detached: cached_pyobj == NULL (no wrapper)
Active: cached_pyobj == self AND self.chandle == chandle (wrapper owns +1)
Inactive: cached_pyobj == self AND self.chandle == NULL (preserved bytes)
Four helpers expose the transitions:
`TVMFFIPyTryGetAttachedPyObject`, `TVMFFIPyAttachPyObject`,
`TVMFFIPyDetachPyObject`, `TVMFFIPyTPFinalize`.
- Universal cache-on for `make_ret`. Every FFI return funnels through
`make_ret_object`, which returns the canonical wrapper for a chandle
when one exists. Combined with Inactive -> Active revival on the
cached path, this delivers `a.x is a.x`, stable `id(a.x)` across
drop+refetch, and `f(x) is x` when a function returns the same
chandle it received.
- Inactive state via Cython `def __del__` on `CObject`, mapping to
`tp_finalize` (PEP 442). When other C++ holders keep the chandle
alive past the wrapper's Python refcount hitting 0,
`TVMFFIPyTPFinalize` nulls `self.chandle` BEFORE DECREFing the
chandle, then `Py_IncRef(self)` to resurrect the PyObject.
`cached_pyobj` is unchanged so a future re-fetch can rebind. Cython
auto-generates `tp_dealloc`, which now enters with the
`self.chandle == NULL` invariant established by `tp_finalize`.
Dropped the Py_LIMITED_API / SABI shim and the
`CYTHON_USE_TP_FINALIZE` define.
- Shutdown guard. `TVMFFIPyMarkPythonFinalizing` is wired to atexit
from Cython module init, flipping a `std::atomic<bool>` flag read
by `TVMFFIPyDeleteSpace` (via `TVMFFIPyIsPythonAlive`) before
`PyGILState_Ensure` to avoid GIL acquire after Python finalization
has begun. Inactive wrapper bytes on chandles still alive at that
point are intentionally leaked -- process is exiting.
- Frontend-allocation detection by `delete_space` pointer comparison
(`TVMFFIPyIsCanonical`) avoids a flag bit on `TVMFFIObject`.
Pre-Python-init chandles (statically-initialized global functions in
libtvm_ffi.so) carry only the base header; the Python side detects
this and skips the binding install.
`_move()` semantics: callback args alias the caller's wrapper under
universal cache-on (one wrapper, one chandle ref) and FFI function
returns of the same chandle alias the caller's wrapper. `_move()` is
kept as an API: the rvalue setter on either caller or callback side
eager-detaches the canonical-wrapper binding before the C++
AnyViewToOwnedAny transfer nulls the source chandle.
`test_function.py::test_rvalue_ref` is refactored for the new
aliasing-aware use_count expectations.
Tests: full Python suite passes (2332 passed, 19 skipped, 2 xfailed);
Rust suite passes. New `tests/python/test_pyobject_tying.py` covers
Active/Inactive transitions, cache-on aliasing, `_move()` under
cache-on, pickle stress, threading stress, GC integration with the
Inactive state, multi-chandle isolation, and the weakref limitation.
TODO carried in `function.pxi::_get_global_func`: name-keyed dict
cache for static-init Function id-stability. Most registry-resident
Functions are allocated at C++ static init (before the Python
allocator is registered) so their chandle has only the base header --
make_ret_object's cache can't reach them. Deferred to keep this change
small.
| if self.chandle != NULL: | ||
| CHECK_CALL(TVMFFIObjectDecRef(self.chandle)) | ||
| self.chandle = NULL | ||
| def __del__(self): |
There was a problem hiding this comment.
Let's double check and test it out, making sure this __del__ method will be properly invoked. I had some really negative experience with overloading __del__ because it could possibly prevent Python GC from recycling memory in certain cases. Not 100% sure if it applies to Cython's cdef class's __del__ method though.
There was a problem hiding this comment.
a tp_finalize-free version is updated. we are now using tp_alloc + tp_free as alternative.
| cls = make_fallback_cls_for_type_index(type_index) | ||
| obj = cls.__new__(cls) | ||
| (<CObject>obj).chandle = result.v_obj | ||
| TVMFFIPyAttachPyObject(result.v_obj, <PyObject*>obj) |
There was a problem hiding this comment.
comment, this is a new class, so it must be in detached state, attach it
| dlpack.deleter(dlpack) | ||
| pass | ||
| # default return the tensor | ||
| # default return the tensor. |
There was a problem hiding this comment.
NOTE: we don't try to attach py object for tensor types as it may appear in callback and there maybe translation depending on the context
| // ---------- | ||
| // I1. When a PyObject goes out of scope (no Python var refers to it), | ||
| // its +1 on chandle is always released. | ||
| // I2. When a chandle is destroyed, its cached PyObject (if any) is |
There was a problem hiding this comment.
When a chandle is destroyed, we can only be in Detached or Inactive state, if it is in Inactive state, the PyObject is reclaimed
| // ``self.chandle`` and DecRef. | ||
| // Detached -> Detached : ``self`` is not the canonical wrapper for | ||
| // this chandle (eager-detach via move already | ||
| // cleared the binding). Just DecRef. |
There was a problem hiding this comment.
need clarification here, since Detached menas chandle == NULL
| */ | ||
| TVM_FFI_INLINE void TVMFFIPyTPFinalize(void** ptr_to_chandle, PyObject* wrapper) { | ||
| void* chandle = *ptr_to_chandle; | ||
| if (chandle == nullptr) return; |
| * | ||
| * Detached -> Detached: ``cached_pyobj`` was already cleared by an | ||
| * eager move, or chandle was not allocated through the Python custom | ||
| * allocator. Just null ``wrapper.chandle`` and DecRef. |
There was a problem hiding this comment.
no need to DecRef here? just return and pass to dealloc
Refine the Python-wrapper-to-C++-object tying mechanism introduced in the prior commit. Move wrapper cleanup from tp_finalize (__del__) to tp_dealloc (__dealloc__), and revive an Inactive (dead-but-cached) allocation in place so a re-fetched wrapper keeps a stable id() at the same address. - object.pxi: drive cleanup from __dealloc__ via TVMFFIPyTpDealloc; rework make_ret_object into the cache-&-revive dispatcher (LIVE HIT drops the caller's redundant +1; REVIVAL reuses the inactive allocation via the thread-local revive slot); install the custom tp_alloc / tp_free slots for every registered FFI type at the _update_registry choke point. - helpers.h: two-output TVMFFIPyTryGetAttachedPyObject(chandle, &out) plus TVMFFIPyIsDetached; document the tagged_pyobj state machine (Detached / Active / Inactive), its invariants, and where each transition happens. - base.pxi / function.pxi: track the renamed externs. - tests: add TestNoUnboundedLeak (one pinned wrapper per distinct chandle must not accumulate) and trim heavy iteration counts on the tying tests.
|
Btw - does this approach also work for String and Bytes? I’m asking because I wasn’t sure if small string optimization is compatible with this design |
|
@junrushao Not working for |
| TVMFFIObjectAllocHeader base; | ||
| }; | ||
|
|
||
| constexpr size_t kPyHeaderOffset = sizeof(PyCustomAllocHeader); // 16 |
| * need ``PyObject_InitVar(.., nitems)`` here and a matching basicsize check. | ||
| */ | ||
| inline PyObject* TVMFFIPyTpAlloc(PyTypeObject* type, Py_ssize_t nitems) { | ||
| void* blk = TVMFFIPyTlsReviveSlot(); |
| * count). Per-thread -> free-threading safe; strict read-and-clear. | ||
| */ | ||
| inline void*& TVMFFIPyTlsReviveSlot() { | ||
| static thread_local void* slot = nullptr; |
| * ``TVMFFIPyTpAlloc`` (which is handed only ``type`` and an item | ||
| * count). Per-thread -> free-threading safe; strict read-and-clear. | ||
| */ | ||
| inline void*& TVMFFIPyTlsReviveSlot() { |
| // stable wrapper class for the life of the process, and ``make_ret_object`` | ||
| // derives both the cached allocation (from the chandle) and ``type`` (= cls for that | ||
| // same type_index) from the very same chandle on the revival path. | ||
| assert(type->tp_itemsize == 0 && |
There was a problem hiding this comment.
static PyObject* custom_tp_alloc(PyTypeObject *type, Py_ssize_t nitems) {
PyCustomizeAllocHeader *header = get_current_header();
// Hot Path: Reuse
if (header && header->cached_mem) {
PyObject *obj = header->cached_mem;
// Inline-friendly initialization
PyObject_VarInit((PyVarObject *)obj, type, nitems);
// Only memset the delta between header and body
// This is where your real time is spent
size_t head_sz = (nitems > 0 || type->tp_itemsize > 0) ? sizeof(PyVarObject) : sizeof(PyObject);
memset((char*)obj + head_sz, 0, type->tp_basicsize - head_sz);
if (PyType_IS_GC(type)) {
PyObject_GC_Track(obj);
}
return obj;
}
// Cold Path: Slow allocation
return PyType_GenericAlloc(type, nitems);
}| * on, GC or not. | ||
| */ | ||
| inline void TVMFFIPyTpFree(void* self) { | ||
| if (reinterpret_cast<TVMFFIPyCObjectHead*>(self)->chandle == TVM_FFI_PY_INACTIVE_SENTINEL) { |
There was a problem hiding this comment.
add a runtime offset check test case
There was a problem hiding this comment.
void** TVMFFICyObjectGetCHandlePtr(PyObject* ptr)
cdef void** TVMFFICyObjectGetCHandlePtr(PyObject* ptr):
return &((<CObject>ptr).chandle)
| TVM_FFI_INLINE void TVMFFIPyTpDealloc(void** ptr_to_chandle, PyObject* wrapper) { | ||
| void* chandle = *ptr_to_chandle; | ||
| // NULL: already detached/moved. INACTIVE_SENTINEL: an inactive cached | ||
| // allocation must never re-enter __dealloc__ while inactive (defensive). |
There was a problem hiding this comment.
move TVMFFIPyTpDealloc after TVMFFIPyTpAlloc, TVM_FFI_PY_INACTIVE_SENTINEL/TVMFFIPyTpDealloc/ TVMFFIPyTpFree/ sit together, document the overall logic in a section, TVM_FFI_PY_INACTIVE_SENTINEL=> TVM_FFI_PY_CHANDLE_INACTIVE_SENTINEL
| * header. Just null ``wrapper.chandle`` and DecRef. | ||
| */ | ||
| TVM_FFI_INLINE void TVMFFIPyTpDealloc(void** ptr_to_chandle, PyObject* wrapper) { | ||
| void* chandle = *ptr_to_chandle; |
There was a problem hiding this comment.
Document, after dealloc, there are two possibilities:
- chandle is set to CHANDLE_INACTIVE_SENTINEL, tpfree will not be triggered
- other cases, normal free
| * time); the only cost is no stable-id-across-drop. | ||
| */ | ||
| TVM_FFI_INLINE bool TVMFFIPyIsInactiveEligible(PyObject* wrapper) { | ||
| PyTypeObject* tp = Py_TYPE(wrapper); |
| * | ||
| * The (return, ``*out``) pair has four combinations; one cannot occur: | ||
| * (true, non-NULL) : Active -- the live canonical wrapper. | ||
| * (false, NULL) : Detached / non-Python -- no wrapper bound. |
There was a problem hiding this comment.
document
\return whether returned PyObject state is active
…ers.h Rework the PyObject-tying dealloc documentation to describe the current race-free logic concisely rather than re-deriving the resolved TOCTOU: - Trim the handshake overview to core flow; keep the Case 0/1/2 anchors. - Unify terminology on Inactive / cached allocation (drop husk / park). - Fold the DEALLOC_TRANSIT transience note into invariant I3; simplify the macro doc and the TVMFFIPyCObjectHead mirror comment. - Deduplicate the three callback docstrings against the overview and the transition map; restore the dropped non-canonical sub-case in TpDealloc. - Remove the dead defensive 'chandle == DEALLOC_TRANSIT' guard branch in TVMFFIPyTpDealloc (the marker never outlives its own dealloc chain) and document the relied-upon invariant. Test: tighten live_inner_count to 'type(ob) is InnerLeak' (exact-type match).
Apply Tianqi's PR apache#593 review feedback to the pyobject-tying layer: - Drop the kPyHeaderOffset constant; use sizeof(PyCustomAllocHeader) directly at every site (and in the static_asserts). - Rename TVMFFIPyTlsReviveSlot -> TVMFFIPyTLSReviveSlot and make its sole access primitive a swap-exchange (store next, return prior). The slot is typed PyObject*, and taking the block is TVMFFIPyTLSReviveSlot(nullptr) -- read-and-clear in one step, no separate clear to forget. - Document \param/\return on TVMFFIPyTryGetAttachedPyObject. - Reorder the slot functions into execution order (TpAlloc -> TpDealloc -> TpFree) with the DEALLOC_TRANSIT marker declared between them. - Replace the hand-rolled TVMFFIPyCObjectHead layout mirror with a Cython cdef public accessor TVMFFICyObjectGetCHandlePtr, so the chandle field offset comes from the real cdef class layout instead of a hardcoded one. Build clean; full Python suite passes (2336 passed, 18 skipped, 2 xfailed).
Resolve the lint job failures on PR apache#593: - ruff-check: add one-line docstrings to test methods (D102), hoist threading/weakref imports to top-level (PLC0415). - ty: annotate the rvalue-move callback params as `Any` instead of `object` so `x._move()` resolves (matches test_function.py). - ruff-format / clang-format / cmake-format: apply formatter reflow. - taplo-format: expand the cibuildwheel `build` array to multi-line in pyproject.toml (applied from CI log; taplo has no wheel for this platform). Reproduced locally via pre-commit (taplo-format skipped).
|
Any updates on this issue? |
…gn>8 double-free The pyobject-tying state machine stores a wrapper back-pointer in each object's alloc header and mutates it with no synchronization -- correct only under the GIL. On free-threaded CPython (3.14t) concurrent attribute access race-frees the wrapper (segfault in TestThreadingStress). Disable tying wholesale on free-threaded builds at its two one-time setup gates, under #ifdef Py_GIL_DISABLED: TVMFFIPyRegisterDefaultAllocator does not install the custom allocator and TVMFFIPyInstallTypeSlots no-ops. With the builtin allocator left in place every chandle is non-canonical, so all state-machine helpers take their existing no-op tails -- no Cython change, GIL build byte-identical. The cost on free-threaded is losing identity stickiness (a.x is a.x / stable id() / f(x) is x), not correctness. This uncovered a latent double-free in BuiltinDefaultAllocate / BuiltinDefaultDeleteSpace: allocate offset the body by round_up(header, alignment) but free subtracted a fixed sizeof(header), symmetric only for alignment <= 8. Reflection dataclasses request alignof(max_align_t)=16, so a 16-aligned object placed at base+16 was freed at base+8 -> free(): invalid pointer. It was masked on the GIL build because the symmetric custom allocator shadows the builtin one for all Python objects. Use a fixed body offset = alignof(max_align_t), symmetric for every alignment. Mark identity-stickiness tests skipif(free-threaded); value / no-crash / no-leak / distinctness tests stay active.
Summary
Make
a.x is a.x,id(a.x)stable across drop+refetch, andf(x) is xfor FFI returns, by attaching a 16-byte custom-allocator header to every Object allocation. Implements Tianqi Chen's PyObjectTying doc.Before:
After: both hold, and identity is preserved across a wrapper finalize-and-revive cycle whenever the C++ object outlives the wrapper.
Design
A two-layer allocator hook lives in core libtvm_ffi:
TVMFFIObjectAllocHeader { delete_space }— 8-byte base header preceding every Object body.TVMFFICustomAllocator { allocate, context }— process-wide registry; libtvm_ffi installs a builtin default at registry init soTVMFFIGetCustomAllocatornever returns NULL.TVMFFIGetCustomAllocator/TVMFFISetCustomAllocator— frontends override the default at module load.The Python Cython module overrides the global default with
TVMFFIPyAllocate, which prepends a 16-bytePyCustomAllocHeader { PyObject* cached_pyobj; base; }encoding the wrapper binding. The Rust crate (ObjectArc::new[_with_extra_items]) and the Python-defined types inextra/dataclass.ccroute through the same registry, so layout and lifetime semantics are uniform across frontends.The Active vs Inactive distinction lives on the wrapper's own
chandlefield rather than on a tag bit, so the header layout stays pointer-equality clean:header.cached_pyobjwrapper.chandleNULLwrapperchandlewrapperNULLThe state machine and four transition helpers (
TVMFFIPyTryGetAttachedPyObject/TVMFFIPyAttachPyObject/TVMFFIPyDetachPyObject/TVMFFIPyTPFinalize) are concentrated inpython/tvm_ffi/cython/tvm_ffi_python_helpers.h. Three invariants pin the contract:wrapper.chandle != NULLimplies the wrapper owns +1 on that chandle.Implementation highlights
make_ret— every FFI return funnels throughmake_ret_object, which returns the canonical wrapper for a chandle when one exists. Combined with Inactive -> Active revival, this deliversa.x is a.x, stableid(a.x)across drop+refetch, andf(x) is xfor matching returns.tp_finalize— Cythondef __del__maps totp_finalize(PEP 442). When other C++ holders keep the chandle alive past Python refcount 0,TVMFFIPyTPFinalizenullsself.chandlebefore DECREFing the chandle (preserving I3), thenPy_IncRef(self)to resurrect the PyObject.cached_pyobjis left pointing at the surviving wrapper for future revival. Cython auto-generatestp_dealloc, which now enters with theself.chandle == NULLinvariant established bytp_finalize.CYTHON_USE_TP_FINALIZEshim._move()semantics preserved — the rvalue setter eager-detaches the canonical-wrapper binding before the C++ AnyViewToOwnedAny transfer nulls the source chandle, so a downstream cache lookup never sees a stale back-pointer.TVMFFIPyIsCanonicalcomparesdelete_spaceagainstTVMFFIPyDeleteSpaceto recognize Python-allocated chandles, avoiding a flag bit onTVMFFIObject. Pre-Python-init chandles (statically-initialized global functions in libtvm_ffi.so) carry only the base header and skip the binding install.TVMFFIPyMarkPythonFinalizingis wired to atexit and flips astd::atomic<bool>flag;TVMFFIPyDeleteSpacechecks it (viaTVMFFIPyIsPythonAlive) beforePyGILState_Ensureto avoid touching a teardown interpreter.Test plan
tests/python/test_pyobject_tying.pycovers Active/Inactive transitions, cache-on aliasing,_move()under cache-on, pickle stress, threading stress, GC integration with the Inactive state, multi-chandle isolation, and the weakref limitationtest_function.py::test_rvalue_refrefactored for cache-on aliasing semanticsOut-of-scope follow-ups
_get_global_functo deliver id-stability for static-init Functions (whose chandles predate the Python allocator and so carry only the base header). Tracked as a TODO infunction.pxi::_get_global_func.