Skip to content

Performance tuning for NVIDIA Grace-Hopper for the Gordon Bell runs#987

Draft
ntselepidis wants to merge 3 commits intoMFlowCode:masterfrom
ntselepidis:nvidia_perf_tuning
Draft

Performance tuning for NVIDIA Grace-Hopper for the Gordon Bell runs#987
ntselepidis wants to merge 3 commits intoMFlowCode:masterfrom
ntselepidis:nvidia_perf_tuning

Conversation

@ntselepidis
Copy link
Copy Markdown
Contributor

@ntselepidis ntselepidis commented Aug 17, 2025

User description

This PR will add some changes to reduce registers and improve performance.
Tested with NVHPC nightly on single Santis node with 4 Grace-Hoppers on 3D_IGR_TaylorGreenVortex_nvidia case.


PR Type

Enhancement


Description

  • Add GPU register limit optimization for NVIDIA Grace-Hopper

  • Include GPU memory management directive for Jacobian arrays


Diagram Walkthrough

flowchart LR
  A["CMakeLists.txt"] -- "add maxregcount:165" --> B["GPU Register Limit"]
  C["m_igr.fpp"] -- "add GPU_DECLARE" --> D["GPU Memory Management"]
  B --> E["Performance Optimization"]
  D --> E
Loading

File Walkthrough

Relevant files
Configuration changes
CMakeLists.txt
Set GPU register count limit                                                         

CMakeLists.txt

  • Add -gpu=maxregcount:165 compiler flag to limit GPU register usage
+1/-1     
Enhancement
m_igr.fpp
Add GPU memory management directive                                           

src/simulation/m_igr.fpp

  • Add $:GPU_DECLARE directive for Jacobian arrays (jac, jac_rhs,
    jac_old)
+1/-0     

@ntselepidis ntselepidis changed the title Do some performance tuning for NVIDIA systems Performance tuning for NVIDIA Grace-Hopper for the Gordon Bell runs Aug 17, 2025
@ntselepidis ntselepidis marked this pull request as ready for review August 17, 2025 13:40
@ntselepidis ntselepidis requested review from a team August 17, 2025 13:40
@qodo-code-review
Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

The $:GPU_DECLARE(create='[jac, jac_rhs, jac_old]') is introduced for pointer variables jac, jac_rhs, and jac_old. If these pointers are later associated/allocated differently than expected, the directive may not correctly manage their device residency, potentially causing mismatched host/device pointers or lifetimes. Validate that allocation/association and deallocation flows are compatible with this directive.

real(wp), pointer, contiguous, dimension(:, :, :) :: jac, jac_rhs, jac_old
$:GPU_DECLARE(create='[jac, jac_rhs, jac_old]')
real(wp), allocatable, dimension(:, :, :), pinned, target :: jac_host
Perf Risk

Hard-coding -gpu=maxregcount:165 may improve occupancy on GH200 but can cause spills and regressions on other NVIDIA GPUs or with different kernels/configurations. Consider guarding by architecture, making it configurable, or validating via ptxinfo to ensure no excessive spilling occurs.

target_compile_options(${a_target}
    PRIVATE -gpu=keep,ptxinfo,lineinfo -gpu=maxregcount:165
)

@qodo-code-review
Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Merge GPU flags into one

Combine GPU flags into a single -gpu option to avoid later flags overriding
earlier ones in NVHPC. Using multiple -gpu options can lead to only the last one
being honored, silently dropping requested features. Merge them with commas to
ensure all settings apply.

CMakeLists.txt [501-503]

 target_compile_options(${a_target}
-    PRIVATE -gpu=keep,ptxinfo,lineinfo -gpu=maxregcount:165
+    PRIVATE -gpu=keep,ptxinfo,lineinfo,maxregcount:165
 )
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies that using multiple -gpu flags with the NVHPC compiler can lead to earlier flags being overridden, and proposes the correct fix of merging them.

Medium
Fix GPU declare list syntax

Ensure the GPU declaration uses the correct list syntax expected by your
codegen/macros. If the directive expects a comma-separated list without
brackets, the current bracketed form may be ignored at compile-time, leaving
jac
arrays undeclared on GPU.
*

src/simulation/m_igr.fpp [29-30]

 real(wp), pointer, contiguous, dimension(:, :, :) :: jac, jac_rhs, jac_old
-$:GPU_DECLARE(create='[jac, jac_rhs, jac_old]')
+$:GPU_DECLARE(create='jac, jac_rhs, jac_old')
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies a potential syntax issue in the custom $:GPU_DECLARE directive, but it is speculative as it depends on the macro's specific implementation which is not visible.

Low
  • More

@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.91%. Comparing base (1bf4e9a) to head (8551570).

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #987   +/-   ##
=======================================
  Coverage   40.91%   40.91%           
=======================================
  Files          70       70           
  Lines       20270    20270           
  Branches     2520     2520           
=======================================
  Hits         8293     8293           
  Misses      10439    10439           
  Partials     1538     1538           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sbryngelson
sbryngelson previously approved these changes Aug 17, 2025
@sbryngelson
Copy link
Copy Markdown
Member

it seems the maxregisters flag should be guarded by the device type (i'm sure some devices don't have that many registers)

@ntselepidis ntselepidis marked this pull request as draft August 18, 2025 18:09
@sbryngelson
Copy link
Copy Markdown
Member

@ntselepidis is this supposed to be merged or just for fun?

@wilfonba
Copy link
Copy Markdown
Contributor

My understanding is that the change to m_igr is guarranteed to improve performance, but the addition of -gpu=max:regcount is specific to IGR and could mess with performance for WENO and other existing features.

@sbryngelson
Copy link
Copy Markdown
Member

@ntselepidis what is the status of this? seems like we need a cmake guard for specific architectures or something

@ntselepidis
Copy link
Copy Markdown
Contributor Author

As @wilfonba correctly noted, we found that setting -gpu=maxregcount:128 improves performance of m_igr on Grace-Hopper, as it reduces the registers and leads to higher occupancy.
Before merging this we need to:

  1. guard this in the CMakefile.txt so that it takes effect depending on the target architecture, i.e. Grace-Hopper
  2. evaluate the performance impact of this flag on WENO on Grace-Hopper
  3. evaluate the performance impact of this flag on other architectures such as Blackwell or Ampere
  4. double check if the same value is fine for single and double precision arithmetic

@sbryngelson
Copy link
Copy Markdown
Member

sounds about right

@sbryngelson
Copy link
Copy Markdown
Member

Claude Code Review

PR #987 — Performance tuning for NVIDIA Grace-Hopper (Gordon Bell runs)

Two files changed: CMakeLists.txt (+1/-1) and src/simulation/m_igr.fpp (+1).


Critical Issues

1. -gpu=maxregcount:128 is applied globally without architecture or module scope guard
CMakeLists.txt, NVHPC GPU options block

The flag is added unconditionally inside the NVHPC target_compile_options block that fires for ALL GPU targets and applies to every GPU kernel — WENO reconstruction, FFT, viscous flux, bubble dynamics, etc. — not just IGR.

  • On A100/H100 (Ampere/Hopper), whether capping at 128 registers is net positive for WENO is untested.
  • As @wilfonba noted, the register count flag could negatively impact WENO and other features. As @sbryngelson noted, it should be guarded by device type.
  • No guard on MFC_Unified, no per-file scoping, no CMake opt-in.

Required fix before merge — either gate on MFC_Unified:

if (MFC_Unified)
    target_compile_options(simulation PRIVATE -gpu=maxregcount:128)
endif()

Or scope to just m_igr.fpp.f90 to avoid impacting WENO entirely:

set_source_files_properties(
    "${CMAKE_BINARY_DIR}/fypp/simulation/m_igr.fpp.f90"
    PROPERTIES COMPILE_OPTIONS "-gpu=maxregcount:128"
)

Confidence: 95


Important Issues

2. Verify GPU_DECLARE branch placement in m_igr.fpp

The GitHub diff context reports the $:GPU_DECLARE(create='[jac, jac_rhs, jac_old]') insertion as landing inside the #ifdef __NVCOMPILER_GPU_UNIFIED_MEM block. If true, this is incorrect: in the UVM path, these variables are pointer-typed and managed via @:PREFER_GPU / manual pointer assignment; adding declare create there conflicts with that logic and could cause double-mapping or runtime errors.

The insertion should be in the #else block (non-UVM path, allocatable arrays), consistent with the established pattern for Res_igr, coeff_L, coeff_R in this module.

Please verify the diff against HEAD confirms placement is in the #else branch, not inside #ifdef __NVCOMPILER_GPU_UNIFIED_MEM.

Confidence: 82


Suggestions

  • Document in a CMake comment why the register cap exists and for which hardware, so it is not silently removed in a future cleanup.
  • Add a before/after performance comparison for a WENO-heavy case on a non-Debug working for nvhpc on Phoenix? #200 NVIDIA GPU (A100 or H100) before merging.
  • Single-precision validation (--single) is needed: register pressure differs between real(8) and real(4) kernels.

Strengths

  • The GPU_DECLARE addition (in the #else / non-UVM path) follows the established pattern used for all other allocatable arrays in this module. No raw !/! directives introduced.
  • PR is correctly kept as Draft; author has already identified remaining work.
  • Precision types (wp, stp) are correct throughout — no forbidden patterns (d-literals, real(8), dble) introduced.

@sbryngelson
Copy link
Copy Markdown
Member

Automated Code Review

Summary: 1 critical issue, 1 important issue.

Critical Issue

GPU_DECLARE on conditionally-allocated jac_old conflicts with @:ALLOCATE (confidence: 88%)

File: src/simulation/m_igr.fpp

The new directive $:GPU_DECLARE(create='[jac, jac_rhs, jac_old]') is a module-level declaration that makes all three arrays device-resident for the entire module lifetime. However, jac_old is only allocated conditionally:

if (igr_iter_solver == 1) then
    @:ALLOCATE(jac_old(...))
end if

@:ALLOCATE expands to allocate + GPU_ENTER_DATA(create=...). When the module-level declare create is already active, the subsequent GPU_ENTER_DATA on jac_old at runtime attempts a second device allocation for a variable already in the device table — this is an OpenACC runtime error or produces a stale device pointer. The matching @:DEALLOCATE(jac_old) also emits GPU_EXIT_DATA(delete=...) which conflicts with the module-scope declare expecting the variable to remain resident.

jac and jac_rhs are always allocated unconditionally, so they are safe candidates for GPU_DECLARE.

Fix: Split the declare to exclude jac_old:

$:GPU_DECLARE(create='[jac, jac_rhs]')

The existing @:ALLOCATE(jac_old(...)) already handles GPU device memory for the conditional path.

Important Issue

-gpu=maxregcount is a global flag applied to all kernels project-wide (confidence: 80%)

The flag is correctly gated to nvfortran GPU builds, but it applies to every kernel in the simulation target — WENO, Riemann solvers, viscous flux, bubble dynamics, FFT — when it was tuned specifically for IGR Jacobian kernels. A single global register cap will pessimize register usage for every other kernel.

The standard pattern for feature-specific NVHPC flags in MFC is set_source_files_properties in CMakeLists.txt. Recommend scoping to m_igr.fpp.f90 only, or adding a CMake option (e.g., MFC_MaxRegCount) analogous to MFC_Fastmath so users can opt in.

Strengths

  • CMake scoping is correct: the flag is inside the (MFC_OpenACC AND ARGS_OpenACC) OR (MFC_OpenMP AND ARGS_OpenMP) + NVHPC/PGI block — gfortran, Intel ifx, Cray ftn, and AMD flang are all unaffected
  • Using GPU_DECLARE (Fypp macro) rather than a raw !$acc declare pragma is correct per project rules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants