Fix tcache corruption when fibers migrate across threads under whole-program LTO#7
Draft
azat wants to merge 5 commits into
Draft
Fix tcache corruption when fibers migrate across threads under whole-program LTO#7azat wants to merge 5 commits into
azat wants to merge 5 commits into
Conversation
A worker-thread pool runs ucontext fibers that each do free/swapcontext/malloc
in one frame, so a fiber is routinely resumed on a different OS thread than it
suspended on. With the tsd hoisting bug and a whole-program-LTO build the
inlined fastpath frees/allocates against the previous thread's tcache and the
process crashes; with the fix it runs to completion.
Reproducing requires the allocator inlined next to the swapcontext, so:
- it is a standalone program (no test harness, so it can be static-linked
without symbol clashes) that calls jemalloc via JEMALLOC_MANGLE, so
malloc/free bind to the configured-prefix symbols (the libc wrappers do not
inline) -- independent of --with-jemalloc-prefix;
- Makefile.in links this one test against libjemalloc.a with --whole-archive
(guarded by enable_static), not the shared library;
- it only manifests under whole-program LTO -- a no-op guard otherwise.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `tcache_fiber_migration` reproducer (issue jemalloc#2890) only exercises the bug when the allocator is statically linked and inlined under whole-program LTO. None of the existing lanes build that way, so add a dedicated Linux lane: clang ThinLTO, `--with-jemalloc-prefix=je_`, and llvm-ar/nm/ranlib (to archive the LTO bitcode) + `-fuse-ld=lld`, then `make check`. Authored in scripts/gen_gh_actions.py; .github/workflows/linux-ci.yml regenerated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`tsd_get` returned `&tsd_tls`, a loop-invariant address the compiler forms as `thread_pointer + const_offset`. Under whole-program LTO it can be hoisted out of the inlined `malloc`/`free` and kept in a callee-saved register across a user-space context switch (`swapcontext`, ucontext / `boost::context` fibers, coroutine runtimes). If the context resumes on a *different* OS thread, the cached tsd/tcache still belongs to the previous thread and the two race on one tcache -- silent heap corruption that reproduces only under LTO and is invisible to sanitizers. Route every GNU TSD backend's `tsd_get` through `JEMALLOC_TLS_ADDR(tsd_tls)`, backed by a per-variable accessor (`JEMALLOC_TLS_ADDR_DEFINE`) that takes the address *inside* a `noinline` function holding a `memory` barrier. The optimizer can neither inline it, prove it pure, nor hoist/CSE the access across the switch, so the thread-local is resolved afresh on the running thread. Correct for every TLS model -- the access sits behind an opaque call boundary -- at the cost of one real call per public allocator entry. Non-GNU compilers (MSVC) have no inline asm and keep the plain `&tsd_tls`. An earlier attempt re-read only the thread pointer with a `volatile` asm and added the compile-time offset; that proved insufficient -- clang ThinLTO still hoisted the access and corrupted the tcache (verified: crash with the asm path, clean with this accessor, same source/model/compiler). The address itself has to be materialized behind the call boundary. The accessor has a measurable cost: Non-LTO: | op | no-fix | fix | overhead | |-----------|---------|---------|-----------------| | malloc(1) | ~7.0 ns | 9.79 ns | +2.8 ns (~40%) | | free | ~6.6 ns | 9.23 ns | +2.6 ns (~40%) | LTO: | op | no-fix | fix | overhead | |-----------|----------|----------|-----------------| | malloc(1) | ~7.02 ns | ~8.99 ns | +1.97 ns (~28%) | | free | ~6.49 ns | ~8.42 ns | +1.9 ns (~29%) |
The noinline accessor puts an opaque call on the malloc/free fastpath, which
the caller cannot CSE, schedule across, or keep registers alive over.
test/stress/microbench (malloc(1)/free pairs, pinned core, clang):
unpatched accessor this commit
malloc, no LTO 7.0 ns/op 9.8 (+40%) 7.0 (+-0%)
free, no LTO 6.5 ns/op 9.2 (+40%) 6.5 (+-0%)
malloc, ThinLTO 7.0 ns/op 9.0 (+28%) 7.6 (~+8%)
free, ThinLTO 6.5 ns/op 8.4 (+28%) 7.0 (~+8%)
Under a static TLS model the address is `thread_pointer + offset` with a
thread-independent offset, so inline the access: capture the offset once at
runtime into a global via a noinline helper (lazy, sentinel-initialized), and
re-read the thread pointer on every call with a `volatile` asm the optimizer
may not hoist or CSE (per-arch reads from mimalloc's `mi_prim_tls_slot`).
Both fast-path inputs are hoist-proof: the tp read is volatile, and the offset
global is thread-independent, so hoisting it across a context switch is
harmless.
Note the offset must NOT be computed inline as
`&tsd_tls - __builtin_thread_pointer()`: the two terms are independently
hoistable, and clang ThinLTO keeps the (stale) `&tsd_tls` in a callee-saved
register across the switch while re-materializing the thread-pointer terms,
producing `fresh_tp + (stale_addr - fresh_tp)` -- the volatile read is
algebraically cancelled back to the stale address:
# preheader # loop body
movq %fs:0, %rax movq %fs:0, %rax <- volatile asm
leaq tsd@TPOFF(%rax), %r14 movq %r14, %rcx
subq %fs:0, %rcx
addl (%rax,%rcx), .. <- = stale r14
Capturing the offset behind a call boundary evaluates both terms at one point
on one thread, where the difference really is the tpoff constant.
The fast path is gated on `JEMALLOC_TLS_MODEL_INITIAL_EXEC` (a new configure
define, set whenever jemalloc applies its default initial-exec model): under
the dynamic models the offset is not thread-independent. It also requires GNU
asm, a known architecture, and !_WIN32 (MinGW's thread pointer lives in the
TEB, not fs/gs:0). Everything else keeps the noinline accessor.
Verified with the fiber-migration reproducer (clang ThinLTO, static link,
je_-prefixed entry points): clean, with the inlined fastpath re-reading the
thread pointer after every `swapcontext`.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
958d43f to
9de81c6
Compare
The test links the whole libjemalloc.a (--whole-archive), which contains jemalloc_cpp.o under --enable-cxx, so the link needs libstdc++ (std::set_new_handler, __cxa_*, typeinfo for std::bad_alloc, __gxx_personality_v0). Pull it from $(LIBS) the same way the generic integration rule does; a no-cxx build is unaffected since $(LIBS) then has no -lstdc++. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The problem is that
tsd_get()returns&tsd_tls, which the compiler forms asthread_pointer + const_offsetand treats as loop-invariant. Under whole-program LTO this read is hoisted out of the inlinedmalloc/freeand kept in a callee-saved register across a user-space context switch (swapcontext, ucontext,boost::context). If the fiber resumes on a different OS thread, two threads race on one tcache: silent heap corruption, reproducible only under LTO, invisible to sanitizers.The fix includes two patches:
JEMALLOC_TLS_ADDR(tsd_tls): per-variable noinline accessor that takes theaddress behind a
memorybarrier — an opaque call the optimizer cannothoist/CSE across the switch. Correct for every TLS model (
__tls_get_addr'sresult is equally cacheable, unlike
pthread_getspecific, so dynamic TLS isaffected too). MSVC keeps the plain address.
tpoff constant is captured once at runtime into a global by a noinline
helper;
tsd_getthen doesvolatile_asm_thread_pointer + offset. Notecomputing
&tsd_tls - __builtin_thread_pointer()inline is not correct:clang hoists the stale
&tsd_tlswhile re-materializing the tp terms, andfresh_tp + (stale_addr - fresh_tp)cancels the volatile read back to thestale address (this variant was tested and still corrupts).
test/stress/microbench(malloc(1)/freepairs, pinned core, clang):Refs: jemalloc#2890
Refs: ClickHouse/ClickHouse#102460