Skip to content

[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation)#20201

Open
JulianCloudNTH wants to merge 2 commits into
gh/JulianCloudNTH/22/basefrom
gh/JulianCloudNTH/22/head
Open

[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation)#20201
JulianCloudNTH wants to merge 2 commits into
gh/JulianCloudNTH/22/basefrom
gh/JulianCloudNTH/22/head

Conversation

@JulianCloudNTH

@JulianCloudNTH JulianCloudNTH commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Backend-agnostic GPU-timestamp infrastructure, split out so the general implementation is foundational (below SDPA) while the SDPA-specific dispatch labeling stays above the SDPA op. Composed of: WebGPUQueryPool, a faithful re-port of Vulkan's vkapi::QueryPool (backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}) — same ShaderDuration data model and ticks->ns conversion; three deviations are forced by the WebGPU API (per-dispatch bracketing via a compute-pass timestampWrites descriptor since there is no mid-encoder writeTimestamp; readback via resolveQuerySet + buffer map rather than host-side vkGetQueryPoolResults; the TimestampQuery capability requested as an explicit device feature, fail-open if the adapter lacks it). WebGPUDevice gains timestamp-feature detection, and WebGPUGraph gains a per-dispatch kernel_name label plus execute() bracketing of each compute pass when the pool is active. Opt-in via the WEBGPU_TIMESTAMP_QUERY env var; off by default, so the production execute() path is byte-identical. The SDPA per-kernel labeling lives in the companion "for SDPA" diff above the SDPA op.

Co-authored with Claude.
@exported-using-ghexport

Differential Revision: D108188287

Differential Revision: D108188287

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20201

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 13 Pending, 1 Unrelated Failure

As of commit dbcc312 with merge base 5526971 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@JulianCloudNTH

Copy link
Copy Markdown
Contributor Author

@claude review

@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

Claude finished @JulianCloudNTH's task in 2m 2s —— View job


Code Review

  • Read changed files
  • Analyze WebGPUQueryPool implementation
  • Analyze WebGPUDevice / WebGPUGraph wiring
  • Post review feedback

Overall this is a clean, well-scoped re-port of the Vulkan QueryPool. The env-gating (WEBGPU_TIMESTAMP_QUERY), feature detection, and lazy pool creation keep the non-bench path zero-cost, and the lifetime comment on WGPUPassTimestampWrites tw (must outlive BeginComputePass) is exactly the kind of subtlety worth calling out. Nothing blocking. A few notes below.


Correctness / robustness

1. print_results() mean uses integer division before scalingWebGPUQueryPool.cpp:188

const double mean_us = kv.second.first / kv.second.second / 1000.0;

first is uint64_t (sum of ns) and second is uint32_t (count), so first / second is integer ns division, then /1000.0. The truncation is sub-ns and harmless, but it reads as if it were float math. Consider kv.second.first / 1000.0 / kv.second.second for clarity/consistency with total_us.

2. Single global query pool is not concurrency-safeWebGPUDevice.h:29
The pool lives on the shared WebGPUContext and is reset()/recorded/extracted within execute(). If two WebGPUGraphs on the same context ever execute concurrently they'd clobber durations_/the readback buffer. Almost certainly fine given the current single-threaded execution model, but worth a one-line comment noting the assumption since the pool now outlives any single graph.

3. extract_results() prints zeros on map failure but still proceeds to print_results()WebGPUGraph.cpp:577-578
If the readback map fails, durations_ keeps execution_duration_ns = 0 and print_results() will report 0.000 rows as if they were real measurements. Since this is bench-only that's acceptable, but a clearer signal (e.g. skipping the print, or marking the row) would avoid silently-misleading output.

Design / minor

4. Heavy include pulled into a widely-used headerWebGPUDevice.h:13
WebGPUDevice.h now #includes WebGPUQueryPool.h (which drags in <webgpu/webgpu.h>, <array>, <vector>, <string>). Since the member is std::unique_ptr<WebGPUQueryPool>, a forward declaration would suffice here — but WebGPUContext is an aggregate with no out-of-line destructor, so the complete type is needed wherever it's destroyed. If you want to slim the header, move the destructor (or just leave as-is; the coupling is minor).

5. Redundant context lookupWebGPUGraph.cpp:516
build() already caches device_/instance_ from get_default_webgpu_context(); execute() calls it again to reach ctx->querypool. Fine, just noting the pool could alternatively hang off the graph. Current placement (shared across graphs, grows-only capacity) is a reasonable trade-off.

6. ns_per_tick_ is effectively a constantWebGPUQueryPool.h:81
Per the WebGPU spec, timestamp values are already in nanoseconds, so ns_per_tick_ stays 1.0 and the * ns_per_tick_ multiplications are no-ops. Kept presumably to mirror the Vulkan port's timestampPeriod — reasonable, just flagging it's dead-but-intentional.

Questions

  • The PR says unit tests for the query pool live in the companion SDPA diff above this one. Worth confirming the test for the capacity-overrun throw in reset() (WebGPUQueryPool.cpp:74) lands there, since that's the main guard with branching logic introduced here.

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant