Fix tcache corruption when fibers migrate across threads under whole-program LTO by azat · Pull Request #7 · ClickHouse/jemalloc

azat · 2026-06-09T22:07:13Z

The problem is that tsd_get() returns &tsd_tls, which the compiler forms as thread_pointer + const_offset and treats as loop-invariant. Under whole-program LTO this read is hoisted out of the inlined malloc/free and kept in a callee-saved register across a user-space context switch (swapcontext, ucontext, boost::context). If the fiber resumes on a different OS thread, two threads race on one tcache: silent heap corruption, reproducible only under LTO, invisible to sanitizers.

The fix includes two patches:

JEMALLOC_TLS_ADDR(tsd_tls): per-variable noinline accessor that takes the
address behind a memory barrier — an opaque call the optimizer cannot
hoist/CSE across the switch. Correct for every TLS model (__tls_get_addr's
result is equally cacheable, unlike pthread_getspecific, so dynamic TLS is
affected too). MSVC keeps the plain address.
Inline fast path on top (initial-exec model only, jemalloc's default): the
tpoff constant is captured once at runtime into a global by a noinline
helper; tsd_get then does volatile_asm_thread_pointer + offset. Note
computing &tsd_tls - __builtin_thread_pointer() inline is not correct:
clang hoists the stale &tsd_tls while re-materializing the tp terms, and
fresh_tp + (stale_addr - fresh_tp) cancels the volatile read back to the
stale address (this variant was tested and still corrupts).

test/stress/microbench (malloc(1)/free pairs, pinned core, clang):

	unpatched	accessor only	+ fast path
malloc, no LTO	7.0 ns	9.8 (+40%)	7.0 (±0%)
free, no LTO	6.5 ns	9.2 (+40%)	6.5 (±0%)
malloc, ThinLTO	7.0 ns	9.0 (+28%)	7.6 (~+8%)
free, ThinLTO	6.5 ns	8.4 (+28%)	7.0 (~+8%)

Refs: jemalloc#2890
Refs: ClickHouse/ClickHouse#102460

A worker-thread pool runs ucontext fibers that each do free/swapcontext/malloc in one frame, so a fiber is routinely resumed on a different OS thread than it suspended on. With the tsd hoisting bug and a whole-program-LTO build the inlined fastpath frees/allocates against the previous thread's tcache and the process crashes; with the fix it runs to completion. Reproducing requires the allocator inlined next to the swapcontext, so: - it is a standalone program (no test harness, so it can be static-linked without symbol clashes) that calls jemalloc via JEMALLOC_MANGLE, so malloc/free bind to the configured-prefix symbols (the libc wrappers do not inline) -- independent of --with-jemalloc-prefix; - Makefile.in links this one test against libjemalloc.a with --whole-archive (guarded by enable_static), not the shared library; - it only manifests under whole-program LTO -- a no-op guard otherwise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The `tcache_fiber_migration` reproducer (issue jemalloc#2890) only exercises the bug when the allocator is statically linked and inlined under whole-program LTO. None of the existing lanes build that way, so add a dedicated Linux lane: clang ThinLTO, `--with-jemalloc-prefix=je_`, and llvm-ar/nm/ranlib (to archive the LTO bitcode) + `-fuse-ld=lld`, then `make check`. Authored in scripts/gen_gh_actions.py; .github/workflows/linux-ci.yml regenerated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`tsd_get` returned `&tsd_tls`, a loop-invariant address the compiler forms as `thread_pointer + const_offset`. Under whole-program LTO it can be hoisted out of the inlined `malloc`/`free` and kept in a callee-saved register across a user-space context switch (`swapcontext`, ucontext / `boost::context` fibers, coroutine runtimes). If the context resumes on a *different* OS thread, the cached tsd/tcache still belongs to the previous thread and the two race on one tcache -- silent heap corruption that reproduces only under LTO and is invisible to sanitizers. Route every GNU TSD backend's `tsd_get` through `JEMALLOC_TLS_ADDR(tsd_tls)`, backed by a per-variable accessor (`JEMALLOC_TLS_ADDR_DEFINE`) that takes the address *inside* a `noinline` function holding a `memory` barrier. The optimizer can neither inline it, prove it pure, nor hoist/CSE the access across the switch, so the thread-local is resolved afresh on the running thread. Correct for every TLS model -- the access sits behind an opaque call boundary -- at the cost of one real call per public allocator entry. Non-GNU compilers (MSVC) have no inline asm and keep the plain `&tsd_tls`. An earlier attempt re-read only the thread pointer with a `volatile` asm and added the compile-time offset; that proved insufficient -- clang ThinLTO still hoisted the access and corrupted the tcache (verified: crash with the asm path, clean with this accessor, same source/model/compiler). The address itself has to be materialized behind the call boundary. The accessor has a measurable cost: Non-LTO: | op | no-fix | fix | overhead | |-----------|---------|---------|-----------------| | malloc(1) | ~7.0 ns | 9.79 ns | +2.8 ns (~40%) | | free | ~6.6 ns | 9.23 ns | +2.6 ns (~40%) | LTO: | op | no-fix | fix | overhead | |-----------|----------|----------|-----------------| | malloc(1) | ~7.02 ns | ~8.99 ns | +1.97 ns (~28%) | | free | ~6.49 ns | ~8.42 ns | +1.9 ns (~29%) |

The noinline accessor puts an opaque call on the malloc/free fastpath, which the caller cannot CSE, schedule across, or keep registers alive over. test/stress/microbench (malloc(1)/free pairs, pinned core, clang): unpatched accessor this commit malloc, no LTO 7.0 ns/op 9.8 (+40%) 7.0 (+-0%) free, no LTO 6.5 ns/op 9.2 (+40%) 6.5 (+-0%) malloc, ThinLTO 7.0 ns/op 9.0 (+28%) 7.6 (~+8%) free, ThinLTO 6.5 ns/op 8.4 (+28%) 7.0 (~+8%) Under a static TLS model the address is `thread_pointer + offset` with a thread-independent offset, so inline the access: capture the offset once at runtime into a global via a noinline helper (lazy, sentinel-initialized), and re-read the thread pointer on every call with a `volatile` asm the optimizer may not hoist or CSE (per-arch reads from mimalloc's `mi_prim_tls_slot`). Both fast-path inputs are hoist-proof: the tp read is volatile, and the offset global is thread-independent, so hoisting it across a context switch is harmless. Note the offset must NOT be computed inline as `&tsd_tls - __builtin_thread_pointer()`: the two terms are independently hoistable, and clang ThinLTO keeps the (stale) `&tsd_tls` in a callee-saved register across the switch while re-materializing the thread-pointer terms, producing `fresh_tp + (stale_addr - fresh_tp)` -- the volatile read is algebraically cancelled back to the stale address: # preheader # loop body movq %fs:0, %rax movq %fs:0, %rax <- volatile asm leaq tsd@TPOFF(%rax), %r14 movq %r14, %rcx subq %fs:0, %rcx addl (%rax,%rcx), .. <- = stale r14 Capturing the offset behind a call boundary evaluates both terms at one point on one thread, where the difference really is the tpoff constant. The fast path is gated on `JEMALLOC_TLS_MODEL_INITIAL_EXEC` (a new configure define, set whenever jemalloc applies its default initial-exec model): under the dynamic models the offset is not thread-independent. It also requires GNU asm, a known architecture, and !_WIN32 (MinGW's thread pointer lives in the TEB, not fs/gs:0). Everything else keeps the noinline accessor. Verified with the fiber-migration reproducer (clang ThinLTO, static link, je_-prefixed entry points): clean, with the inlined fastpath re-reading the thread pointer after every `swapcontext`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The test links the whole libjemalloc.a (--whole-archive), which contains jemalloc_cpp.o under --enable-cxx, so the link needs libstdc++ (std::set_new_handler, __cxa_*, typeinfo for std::bad_alloc, __gxx_personality_v0). Pull it from $(LIBS) the same way the generic integration rule does; a no-cxx build is unaffected since $(LIBS) then has no -lstdc++. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

azat and others added 4 commits June 10, 2026 00:11

azat force-pushed the fix-lto-tcache-tls-fiber branch from 958d43f to 9de81c6 Compare June 9, 2026 22:11

azat mentioned this pull request Jun 9, 2026

Fix jemalloc tcache corruption with LTO and re-enable LTO (for jemalloc) ClickHouse/ClickHouse#106898

Draft

azat marked this pull request as draft June 9, 2026 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tcache corruption when fibers migrate across threads under whole-program LTO#7

Fix tcache corruption when fibers migrate across threads under whole-program LTO#7
azat wants to merge 5 commits into
devfrom
fix-lto-tcache-tls-fiber

azat commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azat commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

azat commented Jun 9, 2026 •

edited

Loading