Run host BLAS single-threaded to avoid nested OpenMP crash in SDPA (#20174) by shoumikhin · Pull Request #20174 · pytorch/executorch

shoumikhin · 2026-06-10T05:15:02Z

Summary:

The optimized SDPA kernel (cpu_flash_attention) parallelizes attention across
ExecuTorch's own pthreadpool, and each worker calls cpublas::gemm for the QK
and attnV matmuls. On Linux x86 host builds the BLAS backend is multithreaded
MKL, which spins up its own OpenMP thread team for sgemm. Creating that nested
OpenMP team from inside a pthreadpool worker thread crashes with a SEGV in
__kmp_create_worker / KMP_UBER_GTID.

Full stack (from ASAN):
__kmp_create_worker <- __kmp_allocate_team <- __kmpc_fork_call
<- mkl_blas_sgemm <- executorch::cpublas::gemm (CPUBlas.cpp)
<- _q_at_k_gemm (op_sdpa_impl.h) <- cpu_flash_attention lambda
<- ThreadPool::run worker (pthreadpool)

It only reproduces on x86 host (and surfaces under ASAN); mobile/device builds
link XNNPACK / a single-threaded BLAS and never nest.

Fix: force the host BLAS single-threaded for the duration of each gemm call via
weak OpenMP symbols (omp_get_max_threads / omp_set_num_threads). The symbols are
weak so this compiles to a no-op when OpenMP is not linked (e.g. mobile/device).
ExecuTorch already provides operator-level parallelism through its threadpool,
so a nested BLAS thread team is never wanted on this path.

Differential Revision: D108102185

pytorch-bot · 2026-06-10T05:15:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20174

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 382c3e6 with merge base 0b13b6a ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-openvino-linux / linux-job (gh) (trunk failure)
curl: (22) The requested URL returned error:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-06-10T05:15:11Z

@shoumikhin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108102185.

github-actions · 2026-06-10T05:16:01Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

This PR addresses a Linux x86 host crash caused by nested parallelism: ExecuTorch’s pthreadpool workers call into a multithreaded BLAS (e.g., MKL via OpenMP), which can crash when the BLAS tries to create a nested OpenMP team inside the worker thread. The proposed fix forces BLAS to run single-threaded for the duration of each gemm call when invoked from within an ExecuTorch threadpool worker, using weak OpenMP symbols so the code becomes a no-op when OpenMP isn’t present.

Changes:

Add an RAII guard (ScopedSingleThreadBlas) that temporarily sets omp_set_num_threads(1) during BLAS gemm calls in nested threadpool contexts.
Apply the guard to multiple gemm overloads (float/double/complex) on non-Apple BLAS builds.
Use weak OpenMP symbol declarations to avoid linking a second OpenMP runtime.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

shoumikhin · 2026-06-10T05:39:09Z

+#if defined(ET_BUILD_WITH_BLAS) && !defined(ET_BUILD_FOR_APPLE)
+#include <executorch/extension/threadpool/threadpool_guard.h>
+


Fixed: the threadpool_guard.h include and the entire ScopedSingleThreadBlas mechanism are now gated on ET_USE_THREADPOOL (combined with __linux__). A BLAS build that doesn't link the threadpool extension no longer pulls in this header or any threadpool symbol.

shoumikhin · 2026-06-10T05:39:10Z

+    // Only constrain the BLAS when nested inside an ExecuTorch threadpool
+    // worker (where NoThreadPoolGuard is enabled); leave top-level gemm calls
+    // free to use the threaded BLAS.
+    if (!::executorch::extension::threadpool::NoThreadPoolGuard::is_enabled()) {
+      return;
+    }


Fixed, per your suggestion: NoThreadPoolGuard::is_enabled() is now referenced only under #if defined(__linux__) && defined(ET_USE_THREADPOOL). Without the threadpool there are no worker threads to nest from, so the guard is a correct no-op and there is no link-time dependency on extension/threadpool. (Gating to Linux/ELF also fixed a real macOS link break — the weak omp_* symbols don't null-resolve on Mach-O.)

…ytorch#20174) Summary: The optimized SDPA kernel (cpu_flash_attention) parallelizes attention across ExecuTorch's own pthreadpool, and each worker calls cpublas::gemm for the Q*K and attn*V matmuls. On Linux x86 host builds the BLAS backend is multithreaded MKL, which spins up its own OpenMP thread team for sgemm. Creating that nested OpenMP team from inside a pthreadpool worker thread crashes with a SEGV in __kmp_create_worker / KMP_UBER_GTID. Full stack (from ASAN): __kmp_create_worker <- __kmp_allocate_team <- __kmpc_fork_call <- mkl_blas_sgemm <- executorch::cpublas::gemm (CPUBlas.cpp) <- _q_at_k_gemm (op_sdpa_impl.h) <- cpu_flash_attention lambda <- ThreadPool::run worker (pthreadpool) It only reproduces on x86 host (and surfaces under ASAN); mobile/device builds link XNNPACK / a single-threaded BLAS and never nest. Fix: force the host BLAS single-threaded for the duration of each gemm call via weak OpenMP symbols (omp_get_max_threads / omp_set_num_threads). The symbols are weak so this compiles to a no-op when OpenMP is not linked (e.g. mobile/device). ExecuTorch already provides operator-level parallelism through its threadpool, so a nested BLAS thread team is never wanted on this path. Differential Revision: D108102185

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

…ytorch#20174) Summary: The optimized SDPA kernel (cpu_flash_attention) parallelizes attention across ExecuTorch's own pthreadpool, and each worker calls cpublas::gemm for the Q*K and attn*V matmuls. On Linux x86 host builds the BLAS backend is multithreaded MKL, which spins up its own OpenMP thread team for sgemm. Creating that nested OpenMP team from inside a pthreadpool worker thread crashes with a SEGV in __kmp_create_worker / KMP_UBER_GTID. Full stack (from ASAN): __kmp_create_worker <- __kmp_allocate_team <- __kmpc_fork_call <- mkl_blas_sgemm <- executorch::cpublas::gemm (CPUBlas.cpp) <- _q_at_k_gemm (op_sdpa_impl.h) <- cpu_flash_attention lambda <- ThreadPool::run worker (pthreadpool) It only reproduces on x86 host (and surfaces under ASAN); mobile/device builds link XNNPACK / a single-threaded BLAS and never nest. Fix: force the host BLAS single-threaded for the duration of each gemm call via weak OpenMP symbols (omp_get_max_threads / omp_set_num_threads). The symbols are weak so this compiles to a no-op when OpenMP is not linked (e.g. mobile/device). ExecuTorch already provides operator-level parallelism through its threadpool, so a nested BLAS thread team is never wanted on this path. Differential Revision: D108102185

mergennachin · 2026-06-10T15:12:14Z

@claude Review

shoumikhin requested a review from manuelcandales as a code owner June 10, 2026 05:15

Copilot AI review requested due to automatic review settings June 10, 2026 05:15

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 10, 2026

Copilot started reviewing on behalf of shoumikhin June 10, 2026 05:15 View session

meta-codesync Bot added the meta-exported label Jun 10, 2026

Copilot AI reviewed Jun 10, 2026

View reviewed changes

meta-codesync Bot changed the title ~~Run host BLAS single-threaded to avoid nested OpenMP crash in SDPA~~ Run host BLAS single-threaded to avoid nested OpenMP crash in SDPA (#20174) Jun 10, 2026

shoumikhin force-pushed the export-D108102185 branch from 669a0fb to f89fb2d Compare June 10, 2026 05:39

Copilot AI review requested due to automatic review settings June 10, 2026 05:45

shoumikhin force-pushed the export-D108102185 branch from f89fb2d to fc3ce66 Compare June 10, 2026 05:45

Copilot started reviewing on behalf of shoumikhin June 10, 2026 05:45 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Comment thread kernels/optimized/blas/CPUBlas.cpp Outdated

Comment thread kernels/optimized/blas/CPUBlas.cpp Outdated

shoumikhin force-pushed the export-D108102185 branch from fc3ce66 to 382c3e6 Compare June 10, 2026 06:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run host BLAS single-threaded to avoid nested OpenMP crash in SDPA (#20174)#20174

Run host BLAS single-threaded to avoid nested OpenMP crash in SDPA (#20174)#20174
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:export-D108102185

shoumikhin commented Jun 10, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

pytorch-bot Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

shoumikhin Jun 10, 2026

Uh oh!

shoumikhin Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

mergennachin commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		#if defined(ET_BUILD_WITH_BLAS) && !defined(ET_BUILD_FOR_APPLE)
		#include <executorch/extension/threadpool/threadpool_guard.h>

Conversation

shoumikhin commented Jun 10, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20174

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

meta-codesync Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

shoumikhin Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

shoumikhin Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

mergennachin commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shoumikhin commented Jun 10, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented Jun 10, 2026 •

edited

Loading

This PR needs a `release notes:` label