Skip to content

cuda.bindings benchmarks part 5#1964

Open
danielfrg wants to merge 5 commits intomainfrom
cuda-bindings-bench-part-5
Open

cuda.bindings benchmarks part 5#1964
danielfrg wants to merge 5 commits intomainfrom
cuda-bindings-bench-part-5

Conversation

@danielfrg
Copy link
Copy Markdown
Contributor

@danielfrg danielfrg commented Apr 22, 2026

Description

Follow up #1580

This one is to discuss and improve Cpp harness to make a more stable data collection.

I added a with min-time flag and to report the min/std on the compare.

This is the last one i got, where most of them are <2%.

stream.stream_create_destroy did went back from like 8% to 6% when I collected more time.

-----------------------------------------------------------------------------------------------------
Benchmark                                   C++ (mean)   C++ RSD   Python (mean)   Py RSD    Overhead
-----------------------------------------------------------------------------------------------------
ctx_device.ctx_get_current                        6 ns      1.6%          111 ns     1.3%     +105 ns
ctx_device.ctx_get_device                         9 ns      1.5%          116 ns     1.7%     +107 ns
ctx_device.ctx_set_current                        8 ns      1.0%          101 ns     2.8%      +93 ns
ctx_device.device_get                             6 ns      4.4%          127 ns     2.4%     +121 ns
ctx_device.device_get_attribute                   8 ns      1.6%          190 ns     1.5%     +182 ns
event.event_create_destroy                       79 ns      1.3%          307 ns     2.0%     +228 ns
event.event_query                                77 ns      1.2%          210 ns     7.1%     +133 ns
event.event_record                               88 ns      1.2%          221 ns     2.2%     +133 ns
event.event_synchronize                          98 ns      1.2%          225 ns     3.2%     +128 ns
launch.launch_16_args                          1.59 us      1.2%         3.12 us     1.6%    +1528 ns
launch.launch_16_args_pre_packed               1.57 us      0.5%         1.97 us     0.3%     +395 ns
launch.launch_2048b                            1.68 us      1.9%         2.50 us     1.6%     +820 ns
launch.launch_256_args                         2.34 us      0.9%        16.50 us     1.8%   +14158 ns
launch.launch_512_args                         3.33 us      0.8%        31.31 us     1.7%   +27973 ns
launch.launch_512_args_pre_packed              3.37 us      1.2%         3.77 us     0.8%     +392 ns
launch.launch_512_bools                        3.39 us      0.7%        57.82 us     3.2%   +54435 ns
launch.launch_512_bytes                        3.40 us      0.7%        59.53 us     2.8%   +56134 ns
launch.launch_512_doubles                      3.32 us      0.7%        86.76 us     3.6%   +83438 ns
launch.launch_512_ints                         3.18 us      0.8%        60.59 us     3.6%   +57411 ns
launch.launch_512_longlongs                    3.31 us      0.7%        65.91 us     2.4%   +62605 ns
launch.launch_empty_kernel                     1.62 us      1.9%         1.87 us     1.4%     +250 ns
launch.launch_small_kernel                     1.57 us      0.3%         2.22 us     1.2%     +651 ns
memory.mem_alloc_async_free_async               386 ns      1.1%          747 ns     1.7%     +361 ns
memory.mem_alloc_free                          1.55 us      2.2%         1.98 us     1.5%     +429 ns
memory.memcpy_dtod                             2.07 us      0.5%         2.29 us     1.4%     +219 ns
memory.memcpy_dtoh                             4.99 us      0.3%         5.42 us     0.8%     +437 ns
memory.memcpy_htod                             3.94 us      0.2%         4.00 us     1.0%      +61 ns
module.func_get_attribute                        14 ns      1.5%          212 ns     1.8%     +198 ns
module.module_get_function                       33 ns      1.7%          179 ns     1.9%     +146 ns
module.module_load_unload                      7.73 us      0.7%         8.26 us     1.4%     +528 ns
nvrtc.nvrtc_compile_program                 7194.82 us      1.2%      7294.98 us     1.1%  +100158 ns
nvrtc.nvrtc_create_program                       68 ns      1.2%          670 ns     1.8%     +603 ns
nvrtc.nvrtc_create_program_100_headers        11.08 us      3.7%        13.14 us     3.0%    +2052 ns
pointer_attributes.pointer_get_attribute         27 ns      1.6%          487 ns     2.1%     +460 ns
stream.stream_create_destroy                   3.48 us      6.1%         3.53 us     1.6%      +57 ns
stream.stream_query                              84 ns      1.5%          217 ns     2.2%     +133 ns
stream.stream_synchronize                       115 ns      1.1%          243 ns     2.8%     +128 ns
-----------------------------------------------------------------------------------------------------

This makes me feel a bit better about those numbers but the Python ones do seem more stable except event.event_query.

If we think its worth it to do the process isolation for the Cpp one let me know and I will take a look.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@danielfrg danielfrg requested review from mdboom and rwgk April 22, 2026 18:47
@danielfrg danielfrg self-assigned this Apr 22, 2026
@danielfrg danielfrg added cuda.bindings Everything related to the cuda.bindings module performance labels Apr 22, 2026
@danielfrg danielfrg added this to the cuda.core v1.0.0 milestone Apr 22, 2026

// Calibrate loop count to hit a minimum wall time per value.
template <typename Fn>
std::uint64_t calibrate_loops(const Options& options, Fn&& fn) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calibrate_loops() looks like a custom reimplementation of the same minimum-time calibration idea that tools like pyperf and Google Benchmark already provide. I understand that there are advantages to keeping the existing custom C++ harness / pyperf-compatible JSON output, but if we go that route, I think the custom calibration should match the usual/expected --min-time semantics more closely.

Right now, I do not think --min-time actually guarantees the requested minimum sample duration for a large chunk of these microbenchmarks. The calibration stops once it hits max_loops = 1000000, and BenchmarkSuite::run() then uses that capped loop count as-is. For very fast benchmarks, that cap is still well short of 0.1 s per value. For example:

  • ~6 ns/op tops out around 6 ms
  • ~33 ns/op tops out around 33 ms
  • even ~79 ns/op only reaches around 79 ms

So the new "stable collection" mode still under-times many of the exact fast C++ benchmarks it is meant to stabilize. The same issue also seems to apply to the explicit-loop overload, which means the --min-time semantics are not consistently honored there either.

I think that is important to fix before we rely on these numbers as "min-time collected" results, because readers will reasonably assume that --min-time 0.1 means each measured value is actually being calibrated to about 100 ms.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah thats a good point. I raised the default --max-loops 1M to 100M so typical microbenchmarks (down to ~6 ns/op) can reach a 100 ms target. I kept it for safety but I would also be fine removing the default.

I am mostly matching the pyperf JSON so we can compare stuff but I am happy to change to something else if we think thats better.

std::uint64_t loops = options_.loops;
Options custom = options_;
if (options_.min_time_sec > 0.0) {
loops = calibrate_loops(options_, fn);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generated by Cursor GPT-5.4 Extra High Fast after a few round of prompting, with a few small edits


I think the --min-time direction is great, but for async / stream-backed benchmarks the calibration pass may change the state being measured before the first timed sample. For synchronous operations that is probably harmless, but for enqueue-style operations it looks like calibration can leave a substantial backlog on the persistent stream, so the harness may end up measuring "enqueue on an already busy stream" rather than the quieter starting condition these benchmarks had before.

More concretely, calibrate_loops() repeatedly calls fn() until the target wall time is reached, and BenchmarkSuite::run() then immediately goes into run_benchmark() with no post-calibration drain/reset. That means calibration is not just estimating a loop count; it is also mutating benchmark state right before measurement begins.

Why I think this matters:

  • For async benchmarks, fn() often enqueues work onto a persistent stream rather than fully completing it.
  • A 0.1 s target can mean a lot of enqueued work before the first warmup/value. For example, a ~1.6 us kernel launch benchmark needs on the order of ~60k launches to get there.
  • So the first measured sample may start from a deeply backlogged stream rather than an effectively empty stream.

That seems especially relevant for benchmarks like:

  • kernel launches in bench_launch.cpp
  • cuMemAllocAsync / cuMemFreeAsync in bench_memory.cpp
  • cuEventRecord in bench_event.cpp

The effect may be small for some cases, but since this PR is specifically about improving measurement stability, I think it is worth making sure the calibration step is not also changing what is being measured.

Helpful follow-up directions could be:

  • add a post-calibration drain/reset step before the first measured warmup/value for async benchmarks
  • or do a quick A/B comparison with and without a cuStreamSynchronize() after calibration
  • or calibrate on a separate warmup path/stream if that is practical

If those variants all produce essentially the same numbers, that would be reassuring. If they move the launch/async-memory/event benchmarks materially, then this concern is real.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a set_post_calibrate() hook on BenchmarkSuite that runs after calibration and before the first measured warmup. Wired cuStreamSynchronize(stream) as the hook in the four binaries that enqueue onto the
persistent stream: bench_launch.cpp, bench_memory.cpp, bench_event.cpp, bench_stream.cpp. So calibration no longer leaves a backlog for the measurement phase to start from.

Comment thread benchmarks/cuda_bindings/README.md Outdated
@danielfrg
Copy link
Copy Markdown
Contributor Author

Added those changes from the review. I ran it again on my dev server and didnt see a big change on the results but agree that they should make things more stable.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 27, 2026

I just deleted the last comment. My agent posted this even though I didn't really ask it to. I'm still reviewing.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 27, 2026

This suggested change tightens the bookkeeping in calibrate_loops(): it makes capped_out and last_elapsed_out describe the same calibration round that produced the returned loop count. Before this, those outputs could reflect a different round, which made the calibration warning/debug information misleading.

commit 19444493405249c49bbf0381db96b694c1dba549
Author: Ralf W. Grosse-Kunstleve <rgrossekunst@nvidia.com>
Date:   Mon Apr 27 12:22:17 2026 -0700

    best_capped, best_elapsed tracking

diff --git a/benchmarks/cuda_bindings/benchmarks/cpp/bench_support.hpp b/benchmarks/cuda_bindings/benchmarks/cpp/bench_support.hpp
index 131f69de54..3c76b93caf 100644
--- a/benchmarks/cuda_bindings/benchmarks/cpp/bench_support.hpp
+++ b/benchmarks/cuda_bindings/benchmarks/cpp/bench_support.hpp
@@ -136,15 +136,16 @@ std::uint64_t calibrate_loops(
     // Allow callers (e.g. the explicit-loop overload) to request a minimum
     // starting loop count via options.loops.
     const std::uint64_t start_loops = std::max<std::uint64_t>(1, options.loops);
-    std::uint64_t best = start_loops;
     const std::uint64_t max_loops = std::max<std::uint64_t>(start_loops, options.max_loops);
     const std::uint64_t rounds = std::max<std::uint64_t>(1, options.calibrate_rounds);
 
-    bool capped = false;
-    double last_elapsed = 0.0;
+    std::uint64_t best_loops = 0;
+    bool best_capped = false;
+    double best_elapsed = 0.0;
 
     for (std::uint64_t round = 0; round < rounds; ++round) {
         std::uint64_t loops = start_loops;
+        bool round_capped = false;
         double elapsed = 0.0;
 
         while (true) {
@@ -159,7 +160,7 @@ std::uint64_t calibrate_loops(
                 break;
             }
             if (loops >= max_loops) {
-                capped = true;
+                round_capped = true;
                 break;
             }
             if (loops > max_loops / 2) {
@@ -169,15 +170,16 @@ std::uint64_t calibrate_loops(
             }
         }
 
-        if (loops > best) {
-            best = loops;
+        if (loops >= best_loops) {
+            best_loops = loops;
+            best_capped = round_capped;
+            best_elapsed = elapsed;
         }
-        last_elapsed = elapsed;
     }
 
-    if (capped_out) *capped_out = capped;
-    if (last_elapsed_out) *last_elapsed_out = last_elapsed;
-    return best;
+    if (capped_out) *capped_out = best_capped;
+    if (last_elapsed_out) *last_elapsed_out = best_elapsed;
+    return best_loops;
 }
 
 // Run a benchmark function. The function signature is: void fn() — one call = one operation.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 27, 2026

This suggested change moves the existing post-calibration hook into calibrate_loops() itself, so async benchmarks can drain any queued stream work after each calibration probe. That keeps later probes, later calibration rounds, and the first measured run from inheriting stream state left behind by earlier calibration activity.

commit e531d9a3ca1c2817ff80edce2f4029a23b9665ce
Author: Ralf W. Grosse-Kunstleve <rgrossekunst@nvidia.com>
Date:   Mon Apr 27 12:29:36 2026 -0700

    Integrate post-calibration hook into calibrate_loops to run between calibration probes.

diff --git a/benchmarks/cuda_bindings/benchmarks/cpp/bench_support.hpp b/benchmarks/cuda_bindings/benchmarks/cpp/bench_support.hpp
index 3c76b93caf..d47a63797c 100644
--- a/benchmarks/cuda_bindings/benchmarks/cpp/bench_support.hpp
+++ b/benchmarks/cuda_bindings/benchmarks/cpp/bench_support.hpp
@@ -124,6 +124,7 @@ template <typename Fn>
 std::uint64_t calibrate_loops(
     const Options& options,
     Fn&& fn,
+    const std::function<void()>& post_calibrate = {},
     bool* capped_out = nullptr,
     double* last_elapsed_out = nullptr
 ) {
@@ -156,6 +157,9 @@ std::uint64_t calibrate_loops(
             const auto t1 = std::chrono::steady_clock::now();
             elapsed = std::chrono::duration<double>(t1 - t0).count();
 
+            if (post_calibrate) {
+                post_calibrate();
+            }
             if (elapsed >= options.min_time_sec) {
                 break;
             }
@@ -324,18 +328,18 @@ class BenchmarkSuite {
 public:
     explicit BenchmarkSuite(Options options) : options_(std::move(options)) {}
 
-    // Post-calibration hook. If set, invoked after calibration and before the
-    // first measured warmup/value, for every benchmark in this suite. Intended
-    // for async benchmarks that need to drain state left behind by calibration
-    // (e.g. cuStreamSynchronize on a persistent stream). Can be overridden
-    // per-call via the `post_calibrate` parameter on `run()`.
+    // Post-calibration hook. If set, invoked between calibration probes so
+    // async benchmarks can drain state left behind by each probe before the
+    // next one runs. The final probe leaves the benchmark in a drained state
+    // before the first measured warmup/value. Can be overridden per-call via
+    // the `post_calibrate` parameter on `run()`.
     void set_post_calibrate(std::function<void()> hook) {
         post_calibrate_ = std::move(hook);
     }
 
     // Run a benchmark and record it. The name is used as the benchmark ID.
     // If --min-time is set, loop count is auto-calibrated. `post_calibrate`,
-    // if provided, runs after calibration and before measurement.
+    // if provided, runs between calibration probes to reset async state.
     template <typename Fn>
     void run(
         const std::string& name,
@@ -345,9 +349,8 @@ public:
         std::uint64_t loops = options_.loops;
         Options custom = options_;
         if (options_.min_time_sec > 0.0) {
-            loops = calibrate_and_warn(name, options_, fn);
+            loops = calibrate_and_warn(name, options_, fn, select_post_calibrate(post_calibrate));
             custom.loops = loops;
-            invoke_post_calibrate(post_calibrate);
         }
         auto results = run_benchmark(custom, std::forward<Fn>(fn));
         print_summary(name, results);
@@ -370,9 +373,8 @@ public:
         if (options_.min_time_sec > 0.0) {
             Options calib_opts = options_;
             calib_opts.loops = loops_override;  // floor
-            loops = calibrate_and_warn(name, calib_opts, fn);
+            loops = calibrate_and_warn(name, calib_opts, fn, select_post_calibrate(post_calibrate));
             custom.loops = loops;
-            invoke_post_calibrate(post_calibrate);
         }
         auto results = run_benchmark(custom, std::forward<Fn>(fn));
         print_summary(name, results);
@@ -391,24 +393,24 @@ private:
     std::vector<BenchmarkEntry> entries_;
     std::function<void()> post_calibrate_;
 
-    void invoke_post_calibrate(const std::function<void()>& per_call) const {
+    std::function<void()> select_post_calibrate(const std::function<void()>& per_call) const {
         if (per_call) {
-            per_call();
-        } else if (post_calibrate_) {
-            post_calibrate_();
+            return per_call;
         }
+        return post_calibrate_;
     }
 
     template <typename Fn>
     std::uint64_t calibrate_and_warn(
         const std::string& name,
         const Options& calib_opts,
-        Fn&& fn
+        Fn&& fn,
+        const std::function<void()>& post_calibrate
     ) const {
         bool capped = false;
         double last_elapsed = 0.0;
         std::uint64_t loops = calibrate_loops(
-            calib_opts, std::forward<Fn>(fn), &capped, &last_elapsed
+            calib_opts, std::forward<Fn>(fn), post_calibrate, &capped, &last_elapsed
         );
         if (capped) {
             std::cerr << "WARNING: " << name

Copy link
Copy Markdown
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my last suggestion in this review round.

py_mean = py_stats[0]
cpp_mean = cpp_stats[0]
overhead_ns = (py_mean - cpp_mean) * 1e9
overhead_str = f"+{overhead_ns:.0f} ns"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
overhead_str = f"+{overhead_ns:.0f} ns"
overhead_str = f"{overhead_ns:+.0f} ns"

Super easy, and this will also be correct if a benchmark ever comes out with Python faster than C++.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.bindings Everything related to the cuda.bindings module performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants