Skip to content

Fix tcache corruption when fibers migrate across threads under whole-program LTO#7

Draft
azat wants to merge 5 commits into
devfrom
fix-lto-tcache-tls-fiber
Draft

Fix tcache corruption when fibers migrate across threads under whole-program LTO#7
azat wants to merge 5 commits into
devfrom
fix-lto-tcache-tls-fiber

Conversation

@azat

@azat azat commented Jun 9, 2026

Copy link
Copy Markdown
Member

The problem is that tsd_get() returns &tsd_tls, which the compiler forms as thread_pointer + const_offset and treats as loop-invariant. Under whole-program LTO this read is hoisted out of the inlined malloc/free and kept in a callee-saved register across a user-space context switch (swapcontext, ucontext, boost::context). If the fiber resumes on a different OS thread, two threads race on one tcache: silent heap corruption, reproducible only under LTO, invisible to sanitizers.

The fix includes two patches:

  • JEMALLOC_TLS_ADDR(tsd_tls): per-variable noinline accessor that takes the
    address behind a memory barrier — an opaque call the optimizer cannot
    hoist/CSE across the switch. Correct for every TLS model (__tls_get_addr's
    result is equally cacheable, unlike pthread_getspecific, so dynamic TLS is
    affected too). MSVC keeps the plain address.
  • Inline fast path on top (initial-exec model only, jemalloc's default): the
    tpoff constant is captured once at runtime into a global by a noinline
    helper; tsd_get then does volatile_asm_thread_pointer + offset. Note
    computing &tsd_tls - __builtin_thread_pointer() inline is not correct:
    clang hoists the stale &tsd_tls while re-materializing the tp terms, and
    fresh_tp + (stale_addr - fresh_tp) cancels the volatile read back to the
    stale address (this variant was tested and still corrupts).

test/stress/microbench (malloc(1)/free pairs, pinned core, clang):

unpatched accessor only + fast path
malloc, no LTO 7.0 ns 9.8 (+40%) 7.0 (±0%)
free, no LTO 6.5 ns 9.2 (+40%) 6.5 (±0%)
malloc, ThinLTO 7.0 ns 9.0 (+28%) 7.6 (~+8%)
free, ThinLTO 6.5 ns 8.4 (+28%) 7.0 (~+8%)

Refs: jemalloc#2890
Refs: ClickHouse/ClickHouse#102460

azat and others added 4 commits June 10, 2026 00:11
A worker-thread pool runs ucontext fibers that each do free/swapcontext/malloc
in one frame, so a fiber is routinely resumed on a different OS thread than it
suspended on.  With the tsd hoisting bug and a whole-program-LTO build the
inlined fastpath frees/allocates against the previous thread's tcache and the
process crashes; with the fix it runs to completion.

Reproducing requires the allocator inlined next to the swapcontext, so:
  - it is a standalone program (no test harness, so it can be static-linked
    without symbol clashes) that calls jemalloc via JEMALLOC_MANGLE, so
    malloc/free bind to the configured-prefix symbols (the libc wrappers do not
    inline) -- independent of --with-jemalloc-prefix;
  - Makefile.in links this one test against libjemalloc.a with --whole-archive
    (guarded by enable_static), not the shared library;
  - it only manifests under whole-program LTO -- a no-op guard otherwise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `tcache_fiber_migration` reproducer (issue jemalloc#2890) only exercises the bug
when the allocator is statically linked and inlined under whole-program LTO.
None of the existing lanes build that way, so add a dedicated Linux lane:
clang ThinLTO, `--with-jemalloc-prefix=je_`, and llvm-ar/nm/ranlib (to archive
the LTO bitcode) + `-fuse-ld=lld`, then `make check`.

Authored in scripts/gen_gh_actions.py; .github/workflows/linux-ci.yml
regenerated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`tsd_get` returned `&tsd_tls`, a loop-invariant address the compiler forms as
`thread_pointer + const_offset`.  Under whole-program LTO it can be hoisted out
of the inlined `malloc`/`free` and kept in a callee-saved register across a
user-space context switch (`swapcontext`, ucontext / `boost::context` fibers,
coroutine runtimes).  If the context resumes on a *different* OS thread, the
cached tsd/tcache still belongs to the previous thread and the two race on one
tcache -- silent heap corruption that reproduces only under LTO and is invisible
to sanitizers.

Route every GNU TSD backend's `tsd_get` through `JEMALLOC_TLS_ADDR(tsd_tls)`,
backed by a per-variable accessor (`JEMALLOC_TLS_ADDR_DEFINE`) that takes the
address *inside* a `noinline` function holding a `memory` barrier.  The
optimizer can neither inline it, prove it pure, nor hoist/CSE the access across
the switch, so the thread-local is resolved afresh on the running thread.
Correct for every TLS model -- the access sits behind an opaque call boundary --
at the cost of one real call per public allocator entry.  Non-GNU compilers
(MSVC) have no inline asm and keep the plain `&tsd_tls`.

An earlier attempt re-read only the thread pointer with a `volatile` asm and
added the compile-time offset; that proved insufficient -- clang ThinLTO still
hoisted the access and corrupted the tcache (verified: crash with the asm path,
clean with this accessor, same source/model/compiler).  The address itself has
to be materialized behind the call boundary.

The accessor has a measurable cost:

Non-LTO:

| op        | no-fix  | fix     | overhead        |
|-----------|---------|---------|-----------------|
| malloc(1) | ~7.0 ns | 9.79 ns | +2.8 ns (~40%)  |
| free      | ~6.6 ns | 9.23 ns | +2.6 ns (~40%)  |

LTO:

| op        | no-fix   | fix      | overhead        |
|-----------|----------|----------|-----------------|
| malloc(1) | ~7.02 ns | ~8.99 ns | +1.97 ns (~28%) |
| free      | ~6.49 ns | ~8.42 ns | +1.9 ns (~29%)  |
The noinline accessor puts an opaque call on the malloc/free fastpath, which
the caller cannot CSE, schedule across, or keep registers alive over.
test/stress/microbench (malloc(1)/free pairs, pinned core, clang):

                     unpatched      accessor       this commit
  malloc, no LTO     7.0 ns/op      9.8 (+40%)     7.0 (+-0%)
  free,   no LTO     6.5 ns/op      9.2 (+40%)     6.5 (+-0%)
  malloc, ThinLTO    7.0 ns/op      9.0 (+28%)     7.6 (~+8%)
  free,   ThinLTO    6.5 ns/op      8.4 (+28%)     7.0 (~+8%)

Under a static TLS model the address is `thread_pointer + offset` with a
thread-independent offset, so inline the access: capture the offset once at
runtime into a global via a noinline helper (lazy, sentinel-initialized), and
re-read the thread pointer on every call with a `volatile` asm the optimizer
may not hoist or CSE (per-arch reads from mimalloc's `mi_prim_tls_slot`).
Both fast-path inputs are hoist-proof: the tp read is volatile, and the offset
global is thread-independent, so hoisting it across a context switch is
harmless.

Note the offset must NOT be computed inline as
`&tsd_tls - __builtin_thread_pointer()`: the two terms are independently
hoistable, and clang ThinLTO keeps the (stale) `&tsd_tls` in a callee-saved
register across the switch while re-materializing the thread-pointer terms,
producing `fresh_tp + (stale_addr - fresh_tp)` -- the volatile read is
algebraically cancelled back to the stale address:

    # preheader                          # loop body
    movq %fs:0, %rax                     movq %fs:0, %rax     <- volatile asm
    leaq tsd@TPOFF(%rax), %r14           movq %r14, %rcx
                                         subq %fs:0, %rcx
                                         addl (%rax,%rcx), .. <- = stale r14

Capturing the offset behind a call boundary evaluates both terms at one point
on one thread, where the difference really is the tpoff constant.

The fast path is gated on `JEMALLOC_TLS_MODEL_INITIAL_EXEC` (a new configure
define, set whenever jemalloc applies its default initial-exec model): under
the dynamic models the offset is not thread-independent.  It also requires GNU
asm, a known architecture, and !_WIN32 (MinGW's thread pointer lives in the
TEB, not fs/gs:0).  Everything else keeps the noinline accessor.

Verified with the fiber-migration reproducer (clang ThinLTO, static link,
je_-prefixed entry points): clean, with the inlined fastpath re-reading the
thread pointer after every `swapcontext`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test links the whole libjemalloc.a (--whole-archive), which contains
jemalloc_cpp.o under --enable-cxx, so the link needs libstdc++
(std::set_new_handler, __cxa_*, typeinfo for std::bad_alloc,
__gxx_personality_v0).  Pull it from $(LIBS) the same way the generic
integration rule does; a no-cxx build is unaffected since $(LIBS) then
has no -lstdc++.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@azat azat marked this pull request as draft June 9, 2026 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant