Performance tuning for NVIDIA Grace-Hopper for the Gordon Bell runs by ntselepidis · Pull Request #987 · MFlowCode/MFC

ntselepidis · 2025-08-17T13:21:47Z

User description

This PR will add some changes to reduce registers and improve performance.
Tested with NVHPC nightly on single Santis node with 4 Grace-Hoppers on 3D_IGR_TaylorGreenVortex_nvidia case.

PR Type

Enhancement

Description

Add GPU register limit optimization for NVIDIA Grace-Hopper
Include GPU memory management directive for Jacobian arrays

Diagram Walkthrough

flowchart LR
  A["CMakeLists.txt"] -- "add maxregcount:165" --> B["GPU Register Limit"]
  C["m_igr.fpp"] -- "add GPU_DECLARE" --> D["GPU Memory Management"]
  B --> E["Performance Optimization"]
  D --> E

File Walkthrough

Relevant files

Configuration changes

CMakeLists.txt `Set GPU register count limit` CMakeLists.txt Add `-gpu=maxregcount:165` compiler flag to limit GPU register usage	+1/-1

Enhancement

m_igr.fpp `Add GPU memory management directive` src/simulation/m_igr.fpp Add `$:GPU_DECLARE` directive for Jacobian arrays (`jac`, `jac_rhs`, `jac_old`)	+1/-0

qodo-code-review · 2025-08-17T13:40:24Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue The `$:GPU_DECLARE(create='[jac, jac_rhs, jac_old]')` is introduced for pointer variables `jac`, `jac_rhs`, and `jac_old`. If these pointers are later associated/allocated differently than expected, the directive may not correctly manage their device residency, potentially causing mismatched host/device pointers or lifetimes. Validate that allocation/association and deallocation flows are compatible with this directive. real(wp), pointer, contiguous, dimension(:, :, :) :: jac, jac_rhs, jac_old $:GPU_DECLARE(create='[jac, jac_rhs, jac_old]') real(wp), allocatable, dimension(:, :, :), pinned, target :: jac_host Perf Risk Hard-coding `-gpu=maxregcount:165` may improve occupancy on GH200 but can cause spills and regressions on other NVIDIA GPUs or with different kernels/configurations. Consider guarding by architecture, making it configurable, or validating via ptxinfo to ensure no excessive spilling occurs. `target_compile_options(${a_target} PRIVATE -gpu=keep,ptxinfo,lineinfo -gpu=maxregcount:165 )`

qodo-code-review · 2025-08-17T13:40:55Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Merge GPU flags into one Combine GPU flags into a single `-gpu` option to avoid later flags overriding earlier ones in NVHPC. Using multiple `-gpu` options can lead to only the last one being honored, silently dropping requested features. Merge them with commas to ensure all settings apply. CMakeLists.txt [501-503] target_compile_options(${a_target} - PRIVATE -gpu=keep,ptxinfo,lineinfo -gpu=maxregcount:165 + PRIVATE -gpu=keep,ptxinfo,lineinfo,maxregcount:165 ) Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies that using multiple `-gpu` flags with the NVHPC compiler can lead to earlier flags being overridden, and proposes the correct fix of merging them.	Medium
Possible issue	Fix GPU declare list syntax Ensure the GPU declaration uses the correct list syntax expected by your codegen/macros. If the directive expects a comma-separated list without brackets, the current bracketed form may be ignored at compile-time, leaving `jac` arrays undeclared on GPU.* src/simulation/m_igr.fpp [29-30] real(wp), pointer, contiguous, dimension(:, :, :) :: jac, jac_rhs, jac_old -$:GPU_DECLARE(create='[jac, jac_rhs, jac_old]') +$:GPU_DECLARE(create='jac, jac_rhs, jac_old') Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies a potential syntax issue in the custom `$:GPU_DECLARE` directive, but it is speculative as it depends on the macro's specific implementation which is not visible.	Low
More

codecov · 2025-08-17T15:41:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.91%. Comparing base (1bf4e9a) to head (8551570).

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #987   +/-   ##
=======================================
  Coverage   40.91%   40.91%           
=======================================
  Files          70       70           
  Lines       20270    20270           
  Branches     2520     2520           
=======================================
  Hits         8293     8293           
  Misses      10439    10439           
  Partials     1538     1538

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sbryngelson · 2025-08-17T19:41:09Z

it seems the maxregisters flag should be guarded by the device type (i'm sure some devices don't have that many registers)

sbryngelson · 2025-08-25T15:24:46Z

@ntselepidis is this supposed to be merged or just for fun?

wilfonba · 2025-08-26T19:45:18Z

My understanding is that the change to m_igr is guarranteed to improve performance, but the addition of -gpu=max:regcount is specific to IGR and could mess with performance for WENO and other existing features.

sbryngelson · 2025-09-12T16:16:55Z

@ntselepidis what is the status of this? seems like we need a cmake guard for specific architectures or something

ntselepidis · 2025-10-22T12:40:12Z

As @wilfonba correctly noted, we found that setting -gpu=maxregcount:128 improves performance of m_igr on Grace-Hopper, as it reduces the registers and leads to higher occupancy.
Before merging this we need to:

guard this in the CMakefile.txt so that it takes effect depending on the target architecture, i.e. Grace-Hopper
evaluate the performance impact of this flag on WENO on Grace-Hopper
evaluate the performance impact of this flag on other architectures such as Blackwell or Ampere
double check if the same value is fine for single and double precision arithmetic

sbryngelson · 2025-10-23T14:15:48Z

sounds about right

sbryngelson · 2026-05-04T14:43:11Z

Claude Code Review

PR #987 — Performance tuning for NVIDIA Grace-Hopper (Gordon Bell runs)

Two files changed: CMakeLists.txt (+1/-1) and src/simulation/m_igr.fpp (+1).

Critical Issues

1. -gpu=maxregcount:128 is applied globally without architecture or module scope guard
CMakeLists.txt, NVHPC GPU options block

The flag is added unconditionally inside the NVHPC target_compile_options block that fires for ALL GPU targets and applies to every GPU kernel — WENO reconstruction, FFT, viscous flux, bubble dynamics, etc. — not just IGR.

On A100/H100 (Ampere/Hopper), whether capping at 128 registers is net positive for WENO is untested.
As @wilfonba noted, the register count flag could negatively impact WENO and other features. As @sbryngelson noted, it should be guarded by device type.
No guard on MFC_Unified, no per-file scoping, no CMake opt-in.

Required fix before merge — either gate on MFC_Unified:

if (MFC_Unified)
    target_compile_options(simulation PRIVATE -gpu=maxregcount:128)
endif()

Or scope to just m_igr.fpp.f90 to avoid impacting WENO entirely:

set_source_files_properties(
    "${CMAKE_BINARY_DIR}/fypp/simulation/m_igr.fpp.f90"
    PROPERTIES COMPILE_OPTIONS "-gpu=maxregcount:128"
)

Confidence: 95

Important Issues

2. Verify GPU_DECLARE branch placement in m_igr.fpp

The GitHub diff context reports the $:GPU_DECLARE(create='[jac, jac_rhs, jac_old]') insertion as landing inside the #ifdef __NVCOMPILER_GPU_UNIFIED_MEM block. If true, this is incorrect: in the UVM path, these variables are pointer-typed and managed via @:PREFER_GPU / manual pointer assignment; adding declare create there conflicts with that logic and could cause double-mapping or runtime errors.

The insertion should be in the #else block (non-UVM path, allocatable arrays), consistent with the established pattern for Res_igr, coeff_L, coeff_R in this module.

Please verify the diff against HEAD confirms placement is in the #else branch, not inside #ifdef __NVCOMPILER_GPU_UNIFIED_MEM.

Confidence: 82

Suggestions

Document in a CMake comment why the register cap exists and for which hardware, so it is not silently removed in a future cleanup.
Add a before/after performance comparison for a WENO-heavy case on a non-Debug working for nvhpc on Phoenix? #200 NVIDIA GPU (A100 or H100) before merging.
Single-precision validation (--single) is needed: register pressure differs between real(8) and real(4) kernels.

Strengths

The GPU_DECLARE addition (in the #else / non-UVM path) follows the established pattern used for all other allocatable arrays in this module. No raw !/! directives introduced.
PR is correctly kept as Draft; author has already identified remaining work.
Precision types (wp, stp) are correct throughout — no forbidden patterns (d-literals, real(8), dble) introduced.

sbryngelson · 2026-05-06T15:00:37Z

Automated Code Review

Summary: 1 critical issue, 1 important issue.

Critical Issue

GPU_DECLARE on conditionally-allocated jac_old conflicts with @:ALLOCATE (confidence: 88%)

File: src/simulation/m_igr.fpp

The new directive $:GPU_DECLARE(create='[jac, jac_rhs, jac_old]') is a module-level declaration that makes all three arrays device-resident for the entire module lifetime. However, jac_old is only allocated conditionally:

if (igr_iter_solver == 1) then
    @:ALLOCATE(jac_old(...))
end if

@:ALLOCATE expands to allocate + GPU_ENTER_DATA(create=...). When the module-level declare create is already active, the subsequent GPU_ENTER_DATA on jac_old at runtime attempts a second device allocation for a variable already in the device table — this is an OpenACC runtime error or produces a stale device pointer. The matching @:DEALLOCATE(jac_old) also emits GPU_EXIT_DATA(delete=...) which conflicts with the module-scope declare expecting the variable to remain resident.

jac and jac_rhs are always allocated unconditionally, so they are safe candidates for GPU_DECLARE.

Fix: Split the declare to exclude jac_old:

$:GPU_DECLARE(create='[jac, jac_rhs]')

The existing @:ALLOCATE(jac_old(...)) already handles GPU device memory for the conditional path.

Important Issue

-gpu=maxregcount is a global flag applied to all kernels project-wide (confidence: 80%)

The flag is correctly gated to nvfortran GPU builds, but it applies to every kernel in the simulation target — WENO, Riemann solvers, viscous flux, bubble dynamics, FFT — when it was tuned specifically for IGR Jacobian kernels. A single global register cap will pessimize register usage for every other kernel.

The standard pattern for feature-specific NVHPC flags in MFC is set_source_files_properties in CMakeLists.txt. Recommend scoping to m_igr.fpp.f90 only, or adding a CMake option (e.g., MFC_MaxRegCount) analogous to MFC_Fastmath so users can opt in.

Strengths

CMake scoping is correct: the flag is inside the (MFC_OpenACC AND ARGS_OpenACC) OR (MFC_OpenMP AND ARGS_OpenMP) + NVHPC/PGI block — gfortran, Intel ifx, Cray ftn, and AMD flang are all unaffected
Using GPU_DECLARE (Fypp macro) rather than a raw !$acc declare pragma is correct per project rules

Reduce registers

a1ed1a2

ntselepidis changed the title ~~Do some performance tuning for NVIDIA systems~~ Performance tuning for NVIDIA Grace-Hopper for the Gordon Bell runs Aug 17, 2025

ntselepidis marked this pull request as ready for review August 17, 2025 13:40

ntselepidis requested review from a team August 17, 2025 13:40

qodo-code-review Bot added the Review effort 2/5 label Aug 17, 2025

sbryngelson previously approved these changes Aug 17, 2025

View reviewed changes

ntselepidis marked this pull request as draft August 18, 2025 18:09

Set maxregcount to 128

227d018

Merge branch 'master' into nvidia_perf_tuning

8551570

ntselepidis dismissed sbryngelson’s stale review via 8551570 December 18, 2025 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance tuning for NVIDIA Grace-Hopper for the Gordon Bell runs#987

Performance tuning for NVIDIA Grace-Hopper for the Gordon Bell runs#987
ntselepidis wants to merge 3 commits intoMFlowCode:masterfrom
ntselepidis:nvidia_perf_tuning

ntselepidis commented Aug 17, 2025 •

edited

Loading

Uh oh!

qodo-code-review Bot commented Aug 17, 2025

Uh oh!

qodo-code-review Bot commented Aug 17, 2025

Uh oh!

codecov Bot commented Aug 17, 2025 •

edited

Loading

Uh oh!

sbryngelson commented Aug 17, 2025

Uh oh!

sbryngelson commented Aug 25, 2025

Uh oh!

wilfonba commented Aug 26, 2025

Uh oh!

sbryngelson commented Sep 12, 2025

Uh oh!

ntselepidis commented Oct 22, 2025

Uh oh!

sbryngelson commented Oct 23, 2025

Uh oh!

sbryngelson commented May 4, 2026

Uh oh!

sbryngelson commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

ntselepidis commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-code-review Bot commented Aug 17, 2025

PR Reviewer Guide 🔍

Uh oh!

qodo-code-review Bot commented Aug 17, 2025

PR Code Suggestions ✨

Uh oh!

codecov Bot commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sbryngelson commented Aug 17, 2025

Uh oh!

sbryngelson commented Aug 25, 2025

Uh oh!

wilfonba commented Aug 26, 2025

Uh oh!

sbryngelson commented Sep 12, 2025

Uh oh!

ntselepidis commented Oct 22, 2025

Uh oh!

sbryngelson commented Oct 23, 2025

Uh oh!

sbryngelson commented May 4, 2026

Claude Code Review

Critical Issues

Important Issues

Suggestions

Strengths

Uh oh!

sbryngelson commented May 6, 2026

Automated Code Review

Critical Issue

Important Issue

Strengths

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

ntselepidis commented Aug 17, 2025 •

edited

Loading

codecov Bot commented Aug 17, 2025 •

edited

Loading