[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation) by JulianCloudNTH · Pull Request #20201 · pytorch/executorch

JulianCloudNTH · 2026-06-10T20:19:24Z

Stack from ghstack (oldest at bottom):

[ExecuTorch][WebGPU] GPU timestamp query profiling for SDPA #20167
[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache #20087
[ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos #20086
-> [ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation) #20201

Backend-agnostic GPU-timestamp infrastructure, split out so the general implementation is foundational (below SDPA) while the SDPA-specific dispatch labeling stays above the SDPA op. Composed of: WebGPUQueryPool, a faithful re-port of Vulkan's vkapi::QueryPool (backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}) — same ShaderDuration data model and ticks->ns conversion; three deviations are forced by the WebGPU API (per-dispatch bracketing via a compute-pass timestampWrites descriptor since there is no mid-encoder writeTimestamp; readback via resolveQuerySet + buffer map rather than host-side vkGetQueryPoolResults; the TimestampQuery capability requested as an explicit device feature, fail-open if the adapter lacks it). WebGPUDevice gains timestamp-feature detection, and WebGPUGraph gains a per-dispatch kernel_name label plus execute() bracketing of each compute pass when the pool is active. Opt-in via the WEBGPU_TIMESTAMP_QUERY env var; off by default, so the production execute() path is byte-identical. The SDPA per-kernel labeling lives in the companion "for SDPA" diff above the SDPA op.

Co-authored with Claude.
@exported-using-ghexport

Differential Revision: D108188287

[ghstack-poisoned]

pytorch-bot · 2026-06-10T20:19:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20201

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 13 Pending, 1 Unrelated Failure

As of commit dbcc312 with merge base 5526971 ():

NEW FAILURES - The following jobs have failed:

pull / android / build-android (gh)
pull / test-qnn-python-imports-linux / linux-job (gh)
RuntimeError: Command docker exec -t 32ea7c870c3760eb6f30c175d86dd6abeabcd1df59a203db66a139dac3dde4e3 /exec failed with exit code 92

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-multimodal-linux (gemma3-4b) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-10T20:20:24Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

JulianCloudNTH · 2026-06-10T20:47:09Z

@claude review

claude · 2026-06-10T20:47:54Z

Claude finished @JulianCloudNTH's task in 2m 2s —— View job

Code Review

Read changed files
Analyze WebGPUQueryPool implementation
Analyze WebGPUDevice / WebGPUGraph wiring
Post review feedback

Overall this is a clean, well-scoped re-port of the Vulkan QueryPool. The env-gating (WEBGPU_TIMESTAMP_QUERY), feature detection, and lazy pool creation keep the non-bench path zero-cost, and the lifetime comment on WGPUPassTimestampWrites tw (must outlive BeginComputePass) is exactly the kind of subtlety worth calling out. Nothing blocking. A few notes below.

Correctness / robustness

1. print_results() mean uses integer division before scaling — WebGPUQueryPool.cpp:188

const double mean_us = kv.second.first / kv.second.second / 1000.0;

first is uint64_t (sum of ns) and second is uint32_t (count), so first / second is integer ns division, then /1000.0. The truncation is sub-ns and harmless, but it reads as if it were float math. Consider kv.second.first / 1000.0 / kv.second.second for clarity/consistency with total_us.

2. Single global query pool is not concurrency-safe — WebGPUDevice.h:29
The pool lives on the shared WebGPUContext and is reset()/recorded/extracted within execute(). If two WebGPUGraphs on the same context ever execute concurrently they'd clobber durations_/the readback buffer. Almost certainly fine given the current single-threaded execution model, but worth a one-line comment noting the assumption since the pool now outlives any single graph.

3. extract_results() prints zeros on map failure but still proceeds to print_results() — WebGPUGraph.cpp:577-578
If the readback map fails, durations_ keeps execution_duration_ns = 0 and print_results() will report 0.000 rows as if they were real measurements. Since this is bench-only that's acceptable, but a clearer signal (e.g. skipping the print, or marking the row) would avoid silently-misleading output.

Design / minor

4. Heavy include pulled into a widely-used header — WebGPUDevice.h:13
WebGPUDevice.h now #includes WebGPUQueryPool.h (which drags in <webgpu/webgpu.h>, <array>, <vector>, <string>). Since the member is std::unique_ptr<WebGPUQueryPool>, a forward declaration would suffice here — but WebGPUContext is an aggregate with no out-of-line destructor, so the complete type is needed wherever it's destroyed. If you want to slim the header, move the destructor (or just leave as-is; the coupling is minor).

5. Redundant context lookup — WebGPUGraph.cpp:516
build() already caches device_/instance_ from get_default_webgpu_context(); execute() calls it again to reach ctx->querypool. Fine, just noting the pool could alternatively hang off the graph. Current placement (shared across graphs, grows-only capacity) is a reasonable trade-off.

6. ns_per_tick_ is effectively a constant — WebGPUQueryPool.h:81
Per the WebGPU spec, timestamp values are already in nanoseconds, so ns_per_tick_ stays 1.0 and the * ns_per_tick_ multiplications are no-ops. Kept presumably to mirror the Vulkan port's timestampPeriod — reasonable, just flagging it's dead-but-intentional.

Questions

The PR says unit tests for the query pool live in the companion SDPA diff above this one. Worth confirming the test for the capacity-overrun throw in reset() (WebGPUQueryPool.cpp:74) lands there, since that's the main guard with branching logic introduced here.

[ghstack-poisoned]

Update

772fde7

[ghstack-poisoned]

JulianCloudNTH requested review from kirklandsign and larryliu0820 as code owners June 10, 2026 20:19

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 10, 2026

Update

dbcc312

[ghstack-poisoned]

meta-codesync Bot added the meta-exported label Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation)#20201

[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation)#20201
JulianCloudNTH wants to merge 2 commits into
gh/JulianCloudNTH/22/basefrom
gh/JulianCloudNTH/22/head

JulianCloudNTH commented Jun 10, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

JulianCloudNTH commented Jun 10, 2026

Uh oh!

claude Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JulianCloudNTH commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20201

❌ 2 New Failures, 13 Pending, 1 Unrelated Failure

Uh oh!

github-actions Bot commented Jun 10, 2026

This PR needs a release notes: label

Uh oh!

JulianCloudNTH commented Jun 10, 2026

Uh oh!

claude Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Correctness / robustness

Design / minor

Questions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JulianCloudNTH commented Jun 10, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 10, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 10, 2026 •

edited

Loading