Run host BLAS single-threaded to avoid nested OpenMP crash in SDPA (#20174)#20174
Run host BLAS single-threaded to avoid nested OpenMP crash in SDPA (#20174)#20174shoumikhin wants to merge 1 commit into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20174
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 382c3e6 with merge base 0b13b6a ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@shoumikhin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108102185. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
This PR addresses a Linux x86 host crash caused by nested parallelism: ExecuTorch’s pthreadpool workers call into a multithreaded BLAS (e.g., MKL via OpenMP), which can crash when the BLAS tries to create a nested OpenMP team inside the worker thread. The proposed fix forces BLAS to run single-threaded for the duration of each gemm call when invoked from within an ExecuTorch threadpool worker, using weak OpenMP symbols so the code becomes a no-op when OpenMP isn’t present.
Changes:
- Add an RAII guard (
ScopedSingleThreadBlas) that temporarily setsomp_set_num_threads(1)during BLASgemmcalls in nested threadpool contexts. - Apply the guard to multiple
gemmoverloads (float/double/complex) on non-Apple BLAS builds. - Use weak OpenMP symbol declarations to avoid linking a second OpenMP runtime.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #if defined(ET_BUILD_WITH_BLAS) && !defined(ET_BUILD_FOR_APPLE) | ||
| #include <executorch/extension/threadpool/threadpool_guard.h> | ||
|
|
There was a problem hiding this comment.
Fixed: the threadpool_guard.h include and the entire ScopedSingleThreadBlas mechanism are now gated on ET_USE_THREADPOOL (combined with __linux__). A BLAS build that doesn't link the threadpool extension no longer pulls in this header or any threadpool symbol.
| // Only constrain the BLAS when nested inside an ExecuTorch threadpool | ||
| // worker (where NoThreadPoolGuard is enabled); leave top-level gemm calls | ||
| // free to use the threaded BLAS. | ||
| if (!::executorch::extension::threadpool::NoThreadPoolGuard::is_enabled()) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Fixed, per your suggestion: NoThreadPoolGuard::is_enabled() is now referenced only under #if defined(__linux__) && defined(ET_USE_THREADPOOL). Without the threadpool there are no worker threads to nest from, so the guard is a correct no-op and there is no link-time dependency on extension/threadpool. (Gating to Linux/ELF also fixed a real macOS link break — the weak omp_* symbols don't null-resolve on Mach-O.)
…ytorch#20174) Summary: The optimized SDPA kernel (cpu_flash_attention) parallelizes attention across ExecuTorch's own pthreadpool, and each worker calls cpublas::gemm for the Q*K and attn*V matmuls. On Linux x86 host builds the BLAS backend is multithreaded MKL, which spins up its own OpenMP thread team for sgemm. Creating that nested OpenMP team from inside a pthreadpool worker thread crashes with a SEGV in __kmp_create_worker / KMP_UBER_GTID. Full stack (from ASAN): __kmp_create_worker <- __kmp_allocate_team <- __kmpc_fork_call <- mkl_blas_sgemm <- executorch::cpublas::gemm (CPUBlas.cpp) <- _q_at_k_gemm (op_sdpa_impl.h) <- cpu_flash_attention lambda <- ThreadPool::run worker (pthreadpool) It only reproduces on x86 host (and surfaces under ASAN); mobile/device builds link XNNPACK / a single-threaded BLAS and never nest. Fix: force the host BLAS single-threaded for the duration of each gemm call via weak OpenMP symbols (omp_get_max_threads / omp_set_num_threads). The symbols are weak so this compiles to a no-op when OpenMP is not linked (e.g. mobile/device). ExecuTorch already provides operator-level parallelism through its threadpool, so a nested BLAS thread team is never wanted on this path. Differential Revision: D108102185
669a0fb to
f89fb2d
Compare
…ytorch#20174) Summary: The optimized SDPA kernel (cpu_flash_attention) parallelizes attention across ExecuTorch's own pthreadpool, and each worker calls cpublas::gemm for the Q*K and attn*V matmuls. On Linux x86 host builds the BLAS backend is multithreaded MKL, which spins up its own OpenMP thread team for sgemm. Creating that nested OpenMP team from inside a pthreadpool worker thread crashes with a SEGV in __kmp_create_worker / KMP_UBER_GTID. Full stack (from ASAN): __kmp_create_worker <- __kmp_allocate_team <- __kmpc_fork_call <- mkl_blas_sgemm <- executorch::cpublas::gemm (CPUBlas.cpp) <- _q_at_k_gemm (op_sdpa_impl.h) <- cpu_flash_attention lambda <- ThreadPool::run worker (pthreadpool) It only reproduces on x86 host (and surfaces under ASAN); mobile/device builds link XNNPACK / a single-threaded BLAS and never nest. Fix: force the host BLAS single-threaded for the duration of each gemm call via weak OpenMP symbols (omp_get_max_threads / omp_set_num_threads). The symbols are weak so this compiles to a no-op when OpenMP is not linked (e.g. mobile/device). ExecuTorch already provides operator-level parallelism through its threadpool, so a nested BLAS thread team is never wanted on this path. Differential Revision: D108102185
f89fb2d to
fc3ce66
Compare
…ytorch#20174) Summary: The optimized SDPA kernel (cpu_flash_attention) parallelizes attention across ExecuTorch's own pthreadpool, and each worker calls cpublas::gemm for the Q*K and attn*V matmuls. On Linux x86 host builds the BLAS backend is multithreaded MKL, which spins up its own OpenMP thread team for sgemm. Creating that nested OpenMP team from inside a pthreadpool worker thread crashes with a SEGV in __kmp_create_worker / KMP_UBER_GTID. Full stack (from ASAN): __kmp_create_worker <- __kmp_allocate_team <- __kmpc_fork_call <- mkl_blas_sgemm <- executorch::cpublas::gemm (CPUBlas.cpp) <- _q_at_k_gemm (op_sdpa_impl.h) <- cpu_flash_attention lambda <- ThreadPool::run worker (pthreadpool) It only reproduces on x86 host (and surfaces under ASAN); mobile/device builds link XNNPACK / a single-threaded BLAS and never nest. Fix: force the host BLAS single-threaded for the duration of each gemm call via weak OpenMP symbols (omp_get_max_threads / omp_set_num_threads). The symbols are weak so this compiles to a no-op when OpenMP is not linked (e.g. mobile/device). ExecuTorch already provides operator-level parallelism through its threadpool, so a nested BLAS thread team is never wanted on this path. Differential Revision: D108102185
fc3ce66 to
382c3e6
Compare
|
@claude Review |
Summary:
The optimized SDPA kernel (cpu_flash_attention) parallelizes attention across
ExecuTorch's own pthreadpool, and each worker calls cpublas::gemm for the QK
and attnV matmuls. On Linux x86 host builds the BLAS backend is multithreaded
MKL, which spins up its own OpenMP thread team for sgemm. Creating that nested
OpenMP team from inside a pthreadpool worker thread crashes with a SEGV in
__kmp_create_worker / KMP_UBER_GTID.
Full stack (from ASAN):
__kmp_create_worker <- __kmp_allocate_team <- __kmpc_fork_call
<- mkl_blas_sgemm <- executorch::cpublas::gemm (CPUBlas.cpp)
<- _q_at_k_gemm (op_sdpa_impl.h) <- cpu_flash_attention lambda
<- ThreadPool::run worker (pthreadpool)
It only reproduces on x86 host (and surfaces under ASAN); mobile/device builds
link XNNPACK / a single-threaded BLAS and never nest.
Fix: force the host BLAS single-threaded for the duration of each gemm call via
weak OpenMP symbols (omp_get_max_threads / omp_set_num_threads). The symbols are
weak so this compiles to a no-op when OpenMP is not linked (e.g. mobile/device).
ExecuTorch already provides operator-level parallelism through its threadpool,
so a nested BLAS thread team is never wanted on this path.
Differential Revision: D108102185