Skip to content

Run host BLAS single-threaded to avoid nested OpenMP crash in SDPA (#20174)#20174

Open
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:export-D108102185
Open

Run host BLAS single-threaded to avoid nested OpenMP crash in SDPA (#20174)#20174
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:export-D108102185

Conversation

@shoumikhin

@shoumikhin shoumikhin commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary:

The optimized SDPA kernel (cpu_flash_attention) parallelizes attention across
ExecuTorch's own pthreadpool, and each worker calls cpublas::gemm for the QK
and attn
V matmuls. On Linux x86 host builds the BLAS backend is multithreaded
MKL, which spins up its own OpenMP thread team for sgemm. Creating that nested
OpenMP team from inside a pthreadpool worker thread crashes with a SEGV in
__kmp_create_worker / KMP_UBER_GTID.

Full stack (from ASAN):
__kmp_create_worker <- __kmp_allocate_team <- __kmpc_fork_call
<- mkl_blas_sgemm <- executorch::cpublas::gemm (CPUBlas.cpp)
<- _q_at_k_gemm (op_sdpa_impl.h) <- cpu_flash_attention lambda
<- ThreadPool::run worker (pthreadpool)

It only reproduces on x86 host (and surfaces under ASAN); mobile/device builds
link XNNPACK / a single-threaded BLAS and never nest.

Fix: force the host BLAS single-threaded for the duration of each gemm call via
weak OpenMP symbols (omp_get_max_threads / omp_set_num_threads). The symbols are
weak so this compiles to a no-op when OpenMP is not linked (e.g. mobile/device).
ExecuTorch already provides operator-level parallelism through its threadpool,
so a nested BLAS thread team is never wanted on this path.

Differential Revision: D108102185

Copilot AI review requested due to automatic review settings June 10, 2026 05:15
@pytorch-bot

pytorch-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20174

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 382c3e6 with merge base 0b13b6a (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 10, 2026
@meta-codesync

meta-codesync Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@shoumikhin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108102185.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a Linux x86 host crash caused by nested parallelism: ExecuTorch’s pthreadpool workers call into a multithreaded BLAS (e.g., MKL via OpenMP), which can crash when the BLAS tries to create a nested OpenMP team inside the worker thread. The proposed fix forces BLAS to run single-threaded for the duration of each gemm call when invoked from within an ExecuTorch threadpool worker, using weak OpenMP symbols so the code becomes a no-op when OpenMP isn’t present.

Changes:

  • Add an RAII guard (ScopedSingleThreadBlas) that temporarily sets omp_set_num_threads(1) during BLAS gemm calls in nested threadpool contexts.
  • Apply the guard to multiple gemm overloads (float/double/complex) on non-Apple BLAS builds.
  • Use weak OpenMP symbol declarations to avoid linking a second OpenMP runtime.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +26 to +28
#if defined(ET_BUILD_WITH_BLAS) && !defined(ET_BUILD_FOR_APPLE)
#include <executorch/extension/threadpool/threadpool_guard.h>

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: the threadpool_guard.h include and the entire ScopedSingleThreadBlas mechanism are now gated on ET_USE_THREADPOOL (combined with __linux__). A BLAS build that doesn't link the threadpool extension no longer pulls in this header or any threadpool symbol.

Comment thread kernels/optimized/blas/CPUBlas.cpp Outdated
Comment on lines +60 to +65
// Only constrain the BLAS when nested inside an ExecuTorch threadpool
// worker (where NoThreadPoolGuard is enabled); leave top-level gemm calls
// free to use the threaded BLAS.
if (!::executorch::extension::threadpool::NoThreadPoolGuard::is_enabled()) {
return;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, per your suggestion: NoThreadPoolGuard::is_enabled() is now referenced only under #if defined(__linux__) && defined(ET_USE_THREADPOOL). Without the threadpool there are no worker threads to nest from, so the guard is a correct no-op and there is no link-time dependency on extension/threadpool. (Gating to Linux/ELF also fixed a real macOS link break — the weak omp_* symbols don't null-resolve on Mach-O.)

@meta-codesync meta-codesync Bot changed the title Run host BLAS single-threaded to avoid nested OpenMP crash in SDPA Run host BLAS single-threaded to avoid nested OpenMP crash in SDPA (#20174) Jun 10, 2026
shoumikhin added a commit to shoumikhin/executorch that referenced this pull request Jun 10, 2026
…ytorch#20174)

Summary:

The optimized SDPA kernel (cpu_flash_attention) parallelizes attention across
ExecuTorch's own pthreadpool, and each worker calls cpublas::gemm for the Q*K
and attn*V matmuls. On Linux x86 host builds the BLAS backend is multithreaded
MKL, which spins up its own OpenMP thread team for sgemm. Creating that nested
OpenMP team from inside a pthreadpool worker thread crashes with a SEGV in
__kmp_create_worker / KMP_UBER_GTID.

Full stack (from ASAN):
  __kmp_create_worker <- __kmp_allocate_team <- __kmpc_fork_call
   <- mkl_blas_sgemm <- executorch::cpublas::gemm (CPUBlas.cpp)
   <- _q_at_k_gemm (op_sdpa_impl.h) <- cpu_flash_attention lambda
   <- ThreadPool::run worker (pthreadpool)

It only reproduces on x86 host (and surfaces under ASAN); mobile/device builds
link XNNPACK / a single-threaded BLAS and never nest.

Fix: force the host BLAS single-threaded for the duration of each gemm call via
weak OpenMP symbols (omp_get_max_threads / omp_set_num_threads). The symbols are
weak so this compiles to a no-op when OpenMP is not linked (e.g. mobile/device).
ExecuTorch already provides operator-level parallelism through its threadpool,
so a nested BLAS thread team is never wanted on this path.

Differential Revision: D108102185
shoumikhin added a commit to shoumikhin/executorch that referenced this pull request Jun 10, 2026
…ytorch#20174)

Summary:

The optimized SDPA kernel (cpu_flash_attention) parallelizes attention across
ExecuTorch's own pthreadpool, and each worker calls cpublas::gemm for the Q*K
and attn*V matmuls. On Linux x86 host builds the BLAS backend is multithreaded
MKL, which spins up its own OpenMP thread team for sgemm. Creating that nested
OpenMP team from inside a pthreadpool worker thread crashes with a SEGV in
__kmp_create_worker / KMP_UBER_GTID.

Full stack (from ASAN):
  __kmp_create_worker <- __kmp_allocate_team <- __kmpc_fork_call
   <- mkl_blas_sgemm <- executorch::cpublas::gemm (CPUBlas.cpp)
   <- _q_at_k_gemm (op_sdpa_impl.h) <- cpu_flash_attention lambda
   <- ThreadPool::run worker (pthreadpool)

It only reproduces on x86 host (and surfaces under ASAN); mobile/device builds
link XNNPACK / a single-threaded BLAS and never nest.

Fix: force the host BLAS single-threaded for the duration of each gemm call via
weak OpenMP symbols (omp_get_max_threads / omp_set_num_threads). The symbols are
weak so this compiles to a no-op when OpenMP is not linked (e.g. mobile/device).
ExecuTorch already provides operator-level parallelism through its threadpool,
so a nested BLAS thread team is never wanted on this path.

Differential Revision: D108102185
Copilot AI review requested due to automatic review settings June 10, 2026 05:45

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread kernels/optimized/blas/CPUBlas.cpp Outdated
Comment thread kernels/optimized/blas/CPUBlas.cpp Outdated
…ytorch#20174)

Summary:

The optimized SDPA kernel (cpu_flash_attention) parallelizes attention across
ExecuTorch's own pthreadpool, and each worker calls cpublas::gemm for the Q*K
and attn*V matmuls. On Linux x86 host builds the BLAS backend is multithreaded
MKL, which spins up its own OpenMP thread team for sgemm. Creating that nested
OpenMP team from inside a pthreadpool worker thread crashes with a SEGV in
__kmp_create_worker / KMP_UBER_GTID.

Full stack (from ASAN):
  __kmp_create_worker <- __kmp_allocate_team <- __kmpc_fork_call
   <- mkl_blas_sgemm <- executorch::cpublas::gemm (CPUBlas.cpp)
   <- _q_at_k_gemm (op_sdpa_impl.h) <- cpu_flash_attention lambda
   <- ThreadPool::run worker (pthreadpool)

It only reproduces on x86 host (and surfaces under ASAN); mobile/device builds
link XNNPACK / a single-threaded BLAS and never nest.

Fix: force the host BLAS single-threaded for the duration of each gemm call via
weak OpenMP symbols (omp_get_max_threads / omp_set_num_threads). The symbols are
weak so this compiles to a no-op when OpenMP is not linked (e.g. mobile/device).
ExecuTorch already provides operator-level parallelism through its threadpool,
so a nested BLAS thread team is never wanted on this path.

Differential Revision: D108102185
@mergennachin

Copy link
Copy Markdown
Contributor

@claude Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants