fixes for fused moe (qwen3.6, GLM5.1 + MSE calibration#1382
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR improves NVFP4 quantization robustness by adding FP8 per-block scale underflow prevention, introduces an expert-focused quantization recipe, updates quantization display formatting, and expands test coverage. It also refines LLM PTQ preview input handling for non-Whisper models. ChangesNVFP4 and Expert Quantization Improvements
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 98-103: The temporary mutation of w_quantizer_src._amax before
calling copy.deepcopy may leave the source quantizer with _amax == None if
deepcopy raises; change the code around copy.deepcopy(w_quantizer_src) to save
_saved_amax, set w_quantizer_src._amax = None, then perform deepcopy inside a
try block and restore w_quantizer_src._amax = _saved_amax in a finally block;
after deepcopy set w_quantizer._amax = gu_amax_cpu as before so the source state
is always restored even on exceptions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c7efeb50-0d25-4ef7-8b84-e1a0a66662b4
📒 Files selected for processing (7)
examples/llm_ptq/hf_ptq.pymodelopt/torch/export/moe_utils.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/model_quant.pymodelopt/torch/quantization/nn/modules/tensor_quantizer.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.pymodelopt_recipes/general/ptq/nvfp4_experts_only_mse.yaml
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
This PR fixes several real bugs in the fused MoE quantization pipeline (MSE calibration discovery, FP8 scale underflow, uncalibrated expert export). The fixes are well-described in the PR body and address genuine correctness issues. However, there are several concerns:
-
Missing unit tests (critical): No tests are added for any of the bug fixes. The existing
test_fused_experts.pycovers registration/conversion/basic export but doesn't exercise MSE calibration for fused experts, FP8 scale clamping, or the invalid-amax patching logic. Given the complexity of the moe_utils.py changes and the project's known pattern of missing tests, this is a blocking concern. -
Threshold inconsistency:
_MIN_VALID_AMAX = 1e-4is below FP8 E4M3FN minimum (2^-9 ≈ 0.00195), meaning values between 1e-4 and 2e-3 pass the validity check but could still underflow. -
Hardcoded block_size=16: The fallback per-block amax computation in moe_utils.py hardcodes
16. If the actual block size differs, the shape will be wrong. -
Copyright year: New YAML file has
Copyright (c) 2024but LICENSE_HEADER says 2026.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/unit/torch/quantization/plugins/test_fused_experts.py (1)
728-737: ⚡ Quick winAssert the repaired quantizer state, not just the warning.
This still passes if the warning is emitted but the fallback leaves
_amaxwith the wrong per-block shape orglobal_amaxstale/zero. Those are the export failures this change is fixing, so it would be worth capturing the mockedwrapperobjects and asserting the repaired quantizer state directly.Possible tightening
+ captured = [] + + def _capture_export(wrapper, dtype): + captured.append((tuple(wrapper.weight.shape), wrapper.weight_quantizer)) + with ( - patch("modelopt.torch.export.unified_export_hf._export_quantized_weight"), + patch( + "modelopt.torch.export.unified_export_hf._export_quantized_weight", + side_effect=_capture_export, + ), warnings.catch_warnings(record=True) as caught, ): warnings.simplefilter("always") _export_fused_experts(converted, torch.float16) assert any("weight-derived per-block amax" in str(w.message) for w in caught), ( f"No fallback warning emitted for {'zero' if zero_amax else 'None'} amax — Bug 3 regression" ) + for weight_shape, weight_quantizer in captured: + assert weight_quantizer._amax is not None + assert weight_quantizer._amax.numel() == (weight_shape[0] * weight_shape[1]) // 16 + assert weight_quantizer.global_amax is not None + assert weight_quantizer.global_amax.item() > 0 self._cleanup_registry(expert_type)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 728 - 737, The test currently only checks for a fallback warning; update it to also capture the mocked export wrapper(s) and assert the repaired quantizer state after calling _export_fused_experts(converted, torch.float16): specifically, patch and capture the wrapper returned by modelopt.torch.export.unified_export_hf._export_quantized_weight (or the outer wrapper used in the test) and then assert that each quantizer's internal _amax has the expected per-block shape (not a scalar) and that global_amax is updated/non-zero (or not stale) for the converted object’s experts; keep the existing warning assertion but add these direct state assertions on converted (and its quantizer instances) to ensure the fallback actually fixes the quantizer internals.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 109-137: The invalid-_amax repair should only run for per-block
quantizers—before computing _block_size or reshaping weight_slice, check that
(getattr(w_quantizer, "block_sizes", None) or {}).get(-1) is not None and bail
out of this repair branch when it is None; do not fall back to a default block
size (remove the hardcoded 16 default), obtain _block_size from that block_sizes
entry, and only then compute per_block_fallback and assign into
w_quantizer._amax using weight_slice, per_block_fallback, invalid_mask as
currently done.
In `@tests/unit/torch/quantization/test_nvfp4_tensor.py`:
- Around line 32-42: The test's wsf2 is too small and makes per_block_scale
large instead of exercising the FP8-min clamp path; update the wsf2 fixture used
before calling NVFP4QTensor.get_weights_scaling_factor so that per_block_scale
becomes very small (below _FP8_E4M3FN_MIN) for the tiny_weight case — e.g.,
increase wsf2 by many orders of magnitude (replace the current wsf2 value with a
larger magnitude such as using 1e-2/(6.0 * 448.0) or similar) so per_block_scale
= per_block_amax / (6 * wsf2) triggers the FP8 underflow/clamp behavior checked
against _FP8_E4M3FN_MIN.
---
Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 728-737: The test currently only checks for a fallback warning;
update it to also capture the mocked export wrapper(s) and assert the repaired
quantizer state after calling _export_fused_experts(converted, torch.float16):
specifically, patch and capture the wrapper returned by
modelopt.torch.export.unified_export_hf._export_quantized_weight (or the outer
wrapper used in the test) and then assert that each quantizer's internal _amax
has the expected per-block shape (not a scalar) and that global_amax is
updated/non-zero (or not stale) for the converted object’s experts; keep the
existing warning assertion but add these direct state assertions on converted
(and its quantizer instances) to ensure the fallback actually fixes the
quantizer internals.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d9b0177c-879d-409f-a07b-6b174403d0a0
📒 Files selected for processing (5)
modelopt/torch/export/moe_utils.pymodelopt/torch/quantization/model_calib.pymodelopt_recipes/general/ptq/nvfp4_experts_only_mse.yamltests/unit/torch/quantization/plugins/test_fused_experts.pytests/unit/torch/quantization/test_nvfp4_tensor.py
✅ Files skipped from review due to trivial changes (1)
- modelopt_recipes/general/ptq/nvfp4_experts_only_mse.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/quantization/model_calib.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1382 +/- ##
==========================================
+ Coverage 76.64% 76.88% +0.24%
==========================================
Files 478 473 -5
Lines 51730 51417 -313
==========================================
- Hits 39647 39532 -115
+ Misses 12083 11885 -198
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 429-436: The loop assumes each f"{param_name}_weight_quantizers"
ModuleList length equals the leading expert dimension of the parameter, which
can drift and cause indexing errors; modify the branch that processes
parent_module.named_parameters(recurse=False) to retrieve the parameter tensor
(via parent_module.get_parameter(param_name) or getattr), read its leading
dimension, and assert it equals len(qlist) (or raise a clear ValueError) before
iterating experts; reference the symbols parent_module, param_name, qlist
(f"{param_name}_weight_quantizers"), TensorQuantizer, expert_idx, and
weight_quantizers so the check fails fast with a descriptive message when sizes
mismatch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4a9c7d3a-e170-4c6e-80d6-2422c258738e
📒 Files selected for processing (3)
examples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.py
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/quantization/qtensor/nvfp4_tensor.py
| torch.float8_e4m3fn | ||
| fp8_e4m3fn_min = 2**-9 # 0.001953125 — smallest positive subnormal | ||
| per_block_scale = ( | ||
| (per_block_scale * 448.0 / per_block_scale_max) |
There was a problem hiding this comment.
| (per_block_scale * 448.0 / per_block_scale_max) | |
| (per_block_scale.float() * 448.0 / per_block_scale_max) |
There was a problem hiding this comment.
we also need a max clamp in line 130 to 448. I saw some nan's in exported MSE no-sweep checkpoints due to overflow
e.g. during PTQ
weight_scale dtype=torch.float8_e4m3fn min=nan max=nan mean=nan
also need a unit test for overflow
There was a problem hiding this comment.
The operand should already be float upstream at line 124: per_block_amax = weight_quantizer._amax.float()
There was a problem hiding this comment.
Added max clamp to 448 in _cast_per_block_scale_to_fp8
| # Enqueue per-expert quantizers from {param}_weight_quantizers ModuleLists. | ||
| if _qfe_cls is not None and isinstance(parent_module, _qfe_cls): | ||
| for param_name, param in parent_module.named_parameters(recurse=False): | ||
| qlist = getattr(parent_module, f"{param_name}_weight_quantizers", None) | ||
| if not isinstance(qlist, nn.ModuleList): | ||
| continue | ||
| if len(qlist) != param.shape[0]: | ||
| warnings.warn( | ||
| f"Skipping {param_name}_weight_quantizers: list length {len(qlist)} " | ||
| f"does not match parameter leading dimension {param.shape[0]}. " | ||
| "This may indicate a misconfigured fused-experts module.", | ||
| stacklevel=2, | ||
| ) | ||
| continue | ||
| for expert_idx, wq in enumerate(qlist): | ||
| if isinstance(wq, TensorQuantizer) and wq.is_enabled: | ||
| if getattr(wq, "_calibrator", None) is not None: | ||
| weight_quantizers.append((parent_module, (param_name, expert_idx), wq)) | ||
|
|
There was a problem hiding this comment.
can we have a helper method get_weight_quantizers(module) which can support both MoE and regular weight quantizers? This will help avoid the code branching here
| cal = getattr(module, "_calibrator", None) | ||
| if cal and not getattr(module, "_dynamic", False): | ||
| if method in {"entropy"}: | ||
| if method == "entropy": |
There was a problem hiding this comment.
ruff reported this, needed to pass codestyle precommit hook. I'm not sure why it is not reported before
| torch.float8_e4m3fn | ||
| fp8_e4m3fn_min = 2**-9 # 0.001953125 — smallest positive subnormal | ||
| per_block_scale = ( | ||
| (per_block_scale * 448.0 / per_block_scale_max) |
There was a problem hiding this comment.
we also need a max clamp in line 130 to 448. I saw some nan's in exported MSE no-sweep checkpoints due to overflow
e.g. during PTQ
weight_scale dtype=torch.float8_e4m3fn min=nan max=nan mean=nan
also need a unit test for overflow
| w_quantizer = copy.deepcopy(w_quantizer_src) if is_gate_up else w_quantizer_src | ||
| # gate/up share a quantizer — deepcopy so gate_proj and up_proj get | ||
| # independent quantizers that can hold different amax slices. | ||
| if is_gate_up: |
There was a problem hiding this comment.
nit: is_gate_or_up_proj ?
| ) | ||
|
|
||
| # If the weight quantizer was never calibrated, compute amax from weights. | ||
| # Patch invalid per-block amax entries (NaN/inf/negative/zero/too-small/too-large) |
There was a problem hiding this comment.
We should throw an error if there are experts with uncalibrated amax and suggest rerunning PTQ with more calibration samples/seq length. This is what we do in the MCore PTQ path -- because null amax in MCore causes a deadlock in distributed sync. For HF PTQ to have parity with MCore PTQ we should do the same (even though there is no dist sync in HF PTQ)
There was a problem hiding this comment.
Patching is risky as the patched amax could break the checkpoint. for non-null invalid amax, a warning/error should also be thrown
| slice_start = fused_start * amax_dim0 // fused_total | ||
| slice_end = (fused_start + weight_slice.shape[0]) * amax_dim0 // fused_total | ||
| w_quantizer.amax = amax[slice_start:slice_end].contiguous() | ||
| w_quantizer._amax = amax[slice_start:slice_end].contiguous() |
There was a problem hiding this comment.
why are you checking _amax now instead of amax?
| ): | ||
| w_quantizer.amax = weight_slice.abs().amax().to(torch.float32) | ||
| _block_size = (getattr(w_quantizer, "block_sizes", None) or {}).get(-1, 16) | ||
| fallback_per_block = ( |
There was a problem hiding this comment.
the code in this file is too hard to read, too many hardcoded numbers everywhere
| ) | ||
|
|
||
| # Identify weight quantizers by checking if they have corresponding weight parameters | ||
| # Collect weight quantizers (standard + fused-experts per-expert lists). |
There was a problem hiding this comment.
what is a fused-experts per-expert lists ? that seems contradictory, how can it be fused and per-expert at the same time?
| if getattr(weight_quantizer, "_calibrator", None) is not None: | ||
| weight_quantizers.append((parent_module, weight_name, weight_quantizer)) | ||
| # Enqueue per-expert quantizers from {param}_weight_quantizers ModuleLists. | ||
| if _qfe_cls is not None and isinstance(parent_module, _qfe_cls): |
There was a problem hiding this comment.
move this to a helper function
| ) | ||
|
|
||
| # If the weight quantizer was never calibrated, compute amax from weights. | ||
| # Patch invalid per-block amax entries (NaN/inf/negative/zero/too-small/too-large) |
There was a problem hiding this comment.
should this be in the MSE calibrator?
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
cfe4a4a to
ab8a162
Compare
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tests/unit/torch/quantization/test_nvfp4_tensor.py (2)
74-86: ⚡ Quick winStrengthen helper clamp tests to assert actual saturation, not just range safety.
These tests currently pass even if values are merely “valid,” without proving clamp-to-boundary behavior for overflow/underflow inputs.
Proposed diff
def test_helper_clamps_overflow_to_max(self): """Values above 448 must saturate to 448, not cast to NaN (fp8_e4m3fn has no Inf).""" oversized = torch.tensor([100.0, 448.0, 1e3, 1e6]) out = NVFP4QTensor._cast_per_block_scale_to_fp8(oversized).float() assert torch.isfinite(out).all(), f"FP8 cast produced non-finite values: {out.tolist()}" assert (out <= _FP8_E4M3FN_MAX).all(), f"FP8 cast values exceed 448: {out.tolist()}" + assert (out[2:] == _FP8_E4M3FN_MAX).all(), f"Overflow values did not saturate to 448: {out.tolist()}" def test_helper_clamps_underflow_to_min(self): """Values below the FP8 subnormal must clamp up, not collapse to 0.""" tiny = torch.tensor([0.0, 1e-12, 1e-6, _FP8_E4M3FN_MIN / 2]) out = NVFP4QTensor._cast_per_block_scale_to_fp8(tiny).float() assert (out > 0).all(), f"FP8 cast produced zero scales: {out.tolist()}" + assert (out >= _FP8_E4M3FN_MIN).all(), f"Underflow values did not clamp to FP8 min: {out.tolist()}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/test_nvfp4_tensor.py` around lines 74 - 86, Update the two tests to verify actual saturation behavior rather than just finiteness/range: in test_helper_clamps_overflow_to_max, after calling NVFP4QTensor._cast_per_block_scale_to_fp8(oversized).float() assert that values above the max are equal to _FP8_E4M3FN_MAX (or saturate to that boundary) for the relevant indices; in test_helper_clamps_underflow_to_min, assert that values below the FP8 subnormal are clamped up to the smallest positive representable value (compare against _FP8_E4M3FN_MIN or the expected smallest positive subnormal) rather than merely being >0, using NVFP4QTensor._cast_per_block_scale_to_fp8 to produce the output for these comparisons.
104-111: ⚡ Quick winAdd an explicit non-zero assertion in the static-path regression test.
Line 106/Line 109 checks finite and upper bound, but this regression also targets zeroed exported scales; that should be asserted directly.
Proposed diff
per_block_scale, _ = NVFP4QTensor.get_weights_scaling_factor_from_quantizer(q, weight) per_block_scale_f32 = per_block_scale.float() + assert (per_block_scale_f32 > 0).all(), ( + f"Zero static per-block scale found: {per_block_scale_f32.tolist()}" + ) assert torch.isfinite(per_block_scale_f32).all(), ( f"NaN/Inf in exported static per-block scale: {per_block_scale_f32.tolist()}" )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/test_nvfp4_tensor.py` around lines 104 - 111, The test currently checks that per_block_scale_f32 is finite and <= _FP8_E4M3FN_MAX but misses asserting non-zero values; update the NVFP4QTensor.get_weights_scaling_factor_from_quantizer test to also assert that per_block_scale_f32 is strictly greater than zero for all elements (e.g., assert (per_block_scale_f32 > 0).all()) and include a clear failure message referencing per_block_scale_f32 to catch zeroed exported scales.tests/unit/torch/quantization/plugins/test_fused_experts.py (1)
269-324: ⚡ Quick winGuard registry cleanup with
try/finallyin these new tests.
QuantModuleRegistryis global mutable state; if a failure happens before the last cleanup call, later tests can be polluted. Wrap each test body so cleanup always executes.Proposed pattern
def test_export_creates_per_expert_submodules(self): import modelopt.torch.quantization as mtq from modelopt.torch.export.moe_utils import _export_fused_experts model = _TinyMoEModel() expert_type = type(model.moe.experts) self._cleanup_registry(expert_type) - - ... - mtq.quantize(model, quant_cfg, forward_loop=forward_loop) - converted = model.moe.experts - _export_fused_experts(converted, torch.float16) - ... - self._cleanup_registry(expert_type) + try: + ... + mtq.quantize(model, quant_cfg, forward_loop=forward_loop) + converted = model.moe.experts + _export_fused_experts(converted, torch.float16) + ... + finally: + self._cleanup_registry(expert_type)Apply the same pattern to
test_mse_calibration_populates_all_expert_quantizers.Also applies to: 938-969
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 269 - 324, The test leaves global QuantModuleRegistry state cleanup to a normal code path which can be skipped on failures; wrap the test body so _cleanup_registry(expert_type) is always called in a finally block: obtain expert_type = type(model.moe.experts) as before, then run the setup, mtq.quantize, _export_fused_experts and all assertions inside a try block and call self._cleanup_registry(expert_type) in finally; apply the same try/finally pattern to the other test mentioned (test_mse_calibration_populates_all_expert_quantizers) so registry cleanup at the end is guaranteed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 269-324: The test leaves global QuantModuleRegistry state cleanup
to a normal code path which can be skipped on failures; wrap the test body so
_cleanup_registry(expert_type) is always called in a finally block: obtain
expert_type = type(model.moe.experts) as before, then run the setup,
mtq.quantize, _export_fused_experts and all assertions inside a try block and
call self._cleanup_registry(expert_type) in finally; apply the same try/finally
pattern to the other test mentioned
(test_mse_calibration_populates_all_expert_quantizers) so registry cleanup at
the end is guaranteed.
In `@tests/unit/torch/quantization/test_nvfp4_tensor.py`:
- Around line 74-86: Update the two tests to verify actual saturation behavior
rather than just finiteness/range: in test_helper_clamps_overflow_to_max, after
calling NVFP4QTensor._cast_per_block_scale_to_fp8(oversized).float() assert that
values above the max are equal to _FP8_E4M3FN_MAX (or saturate to that boundary)
for the relevant indices; in test_helper_clamps_underflow_to_min, assert that
values below the FP8 subnormal are clamped up to the smallest positive
representable value (compare against _FP8_E4M3FN_MIN or the expected smallest
positive subnormal) rather than merely being >0, using
NVFP4QTensor._cast_per_block_scale_to_fp8 to produce the output for these
comparisons.
- Around line 104-111: The test currently checks that per_block_scale_f32 is
finite and <= _FP8_E4M3FN_MAX but misses asserting non-zero values; update the
NVFP4QTensor.get_weights_scaling_factor_from_quantizer test to also assert that
per_block_scale_f32 is strictly greater than zero for all elements (e.g., assert
(per_block_scale_f32 > 0).all()) and include a clear failure message referencing
per_block_scale_f32 to catch zeroed exported scales.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a5c16175-5c4b-49d7-9cc8-6afcef9ac196
📒 Files selected for processing (8)
examples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/model_quant.pymodelopt/torch/quantization/nn/modules/tensor_quantizer.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.pymodelopt_recipes/general/ptq/nvfp4_experts_only_mse.yamltests/unit/torch/quantization/plugins/test_fused_experts.pytests/unit/torch/quantization/test_nvfp4_tensor.py
🚧 Files skipped from review as they are similar to previous changes (5)
- modelopt/torch/quantization/model_quant.py
- modelopt/torch/quantization/model_calib.py
- modelopt/torch/quantization/qtensor/nvfp4_tensor.py
- modelopt/torch/quantization/nn/modules/tensor_quantizer.py
- examples/llm_ptq/hf_ptq.py
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
|
/claude review |
Claude Review SummaryFindings: CRITICAL: 0 · IMPORTANT: 1 · SUGGESTION: 3 Overall assessmentLow-to-moderate risk. The actual code changes — FP8 E4M3FN clamp helper, Description vs. diff mismatch — heads up for reviewersThe PR description narrates three bugs and their fixes. Only Bug 2 is actually present in this diff. Specifically:
This appears to be a stale description from the original commit (the PR has 6 commits including "address reviewers' feedback" iterations); description should be tightened before merge so the changelog and any future bisect/blame is honest about what landed. Most impactful finding
Suggestions
Nothing blocking — this can ship after the test assertion is strengthened (or the misleading "Bug 1 regression" message in the assertion is rewritten to match what's actually being checked) and the PR description is reconciled with the diff. |
| for idx in range(NUM_EXPERTS): | ||
| assert experts.gate_up_proj_weight_quantizers[idx].amax is not None, ( | ||
| f"gate_up_proj_weight_quantizers[{idx}] not calibrated — Bug 1 regression" | ||
| ) | ||
| assert experts.down_proj_weight_quantizers[idx].amax is not None, ( | ||
| f"down_proj_weight_quantizers[{idx}] not calibrated" | ||
| ) |
There was a problem hiding this comment.
[IMPORTANT Compatibility] This assertion does not actually verify the bug fix described in the PR ("MSE weight calibration: 0it" / per-expert MSE search not running). It only checks that amax is not None, which is set by either:
max_calibrate(...)— which iteratesmodel.named_modules()and reaches per-expertTensorQuantizers inside thenn.ModuleListdirectly, OR_bootstrap_uncalibrated_weight_quantizers(...)— which callsiter_weights_for_calibration()(already overridden on_QuantFusedExperts).
Both run unconditionally before MSE search inside mse_calibrate. So this test passes even if the per-expert MSE search loop is fully skipped — it does not catch the regression its docstring/error message claims to guard against.
To actually verify MSE ran on per-expert quantizers, do one of:
- Snapshot
_amaxafter aalgorithm="max"run, then re-run withalgorithm="mse"and assert at least some per-expert amaxes changed, - Assert
experts.gate_up_proj_weight_quantizers[idx]._calibratoris anMseCalibratorinstance after calibration, - Patch
MseCalibrator.compute_amaxand assert it was called once per expert × per fused-projection.
| Values below the min silently underflow to 0 (zero outputs at inference); values | ||
| above 448 cast to NaN. | ||
| """ | ||
| return per_block_scale.clamp(min=2**-9, max=448.0).to(torch.float8_e4m3fn) |
There was a problem hiding this comment.
[SUGGESTION] The min-clamp at 2**-9 (smallest FP8 E4M3FN subnormal) is a reasonable lower bound, but it pushes legitimately-small block scales into the subnormal range, where FP8 E4M3FN has only 3 mantissa bits of resolution rather than the implicit-1 + 3-bit precision of normals. For a block whose true scale lies just below 2**-6 (the smallest normal), clamping is harmless — the cast already lands in subnormal territory. But for blocks whose true scale is much smaller, clamping to 2**-9 substantially over-states the per-block scale and inflates that block's quantization error. That's still strictly better than the underflow-to-0 it replaces (which silently zeros all 16 weights in the block at inference), so the change is right; just worth a one-line note in the docstring that "very small per-block scales are saturated up to the subnormal floor, which trades some block-level accuracy for the guarantee that no block silently outputs 0."
Also consider asserting input >= 0 in the helper — clamp(min=2**-9) would silently flip a negative value (which shouldn't ever exist for a per-block scale) into a positive subnormal and hide a real bug elsewhere.
| if not keep_high_precision: | ||
| per_block_scale = ( | ||
| (per_block_scale * 448.0 / per_block_scale_max) | ||
| .clamp_(max=448.0) | ||
| .to(torch.float8_e4m3fn) | ||
| per_block_scale = cls._cast_per_block_scale_to_fp8( | ||
| per_block_scale * 448.0 / per_block_scale_max | ||
| ) |
There was a problem hiding this comment.
[SUGGESTION] The original code carried a useful comment explaining why clamp(max=448) is needed in this branch (the [==0]=1.0 safety net combined with a small global_amax produces pre-cast values that overflow FP8 to NaN). That rationale is now gone — the helper docstring describes what the clamp does generally, but a future reader of this branch will not know why this specific multiplication can blow up. Consider keeping the original 4-line "why" comment here, since the math behind the overflow is non-obvious from the helper alone.
| # Strip leading padding tokens so the preview input shows real content | ||
| if model_type != "whisper" and tokenizer is not None and tokenizer.pad_token_id is not None: | ||
| first_non_pad = (preview_input_ids[0] != tokenizer.pad_token_id).nonzero(as_tuple=True)[0] | ||
| if first_non_pad.numel() > 0: | ||
| preview_input_ids = preview_input_ids[:, first_non_pad[0] :] |
There was a problem hiding this comment.
[SUGGESTION] Two minor robustness gaps:
tokenizermay be undefined in some branches reached bypre_quantize— the existing reference at line 936 (elif tokenizer is not None:) already handles aNonetokenizer for whisper-style models. Themodel_type != "whisper"guard you added is necessary for whisper, but if anyone refactorspre_quantizeto be reachable for additional non-tokenizer models, the access here wouldNameError. Since you already establishedtokenizer is not Noneis a real possibility downstream, the gatetokenizer is not None and tokenizer.pad_token_id is not Noneis correctly defensive on that side — just confirming the order matters (short-circuit prevents the attribute access onNone).- After stripping leading pads, if the entire sample was padding (
first_non_pad.numel() == 0), theifbranch is skipped andpreview_input_idsstays as the all-pad tensor — the post-quantizationgeneratecall below will then run on pure pad tokens. That matches prior behavior, but it's worth a one-line warning so a silently-degenerate preview doesn't mislead users into thinking PTQ broke their model.
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review: substantial progress addressing prior critical comments, but a few concerns remain that warrant a human sign-off.
Addressed since prior reviews:
_cast_per_block_scale_to_fp8helper extracted with both min (2^-9) and max (448.0) clamps and used in both static and dynamic paths (realAsma + jenchen13).iter_weights_for_calibrationpolymorphic method added on_QuantFusedExperts(realAsma'sget_weight_quantizersrequest).- Recipe rebased onto the composable
$importsystem (realAsma) — file renamed tonvfp4_experts_only_mse-kv_fp8_cast.yamlmatching the naming convention, copyright year now 2026 (cjluo-nv). - New tests:
test_nvfp4_tensor.pyfor FP8 helper clamp + static/dynamic path regression;TestFusedExpertsMSECalibrationandtest_bootstrap_populates_dead_expert_quantizersfor the calibration discovery / bootstrap paths. os.makedirs(..., exist_ok=True)before writing.quant_summary.txt.- Deepcopy now consolidated to
gate_upbranch only; the temporary_amax = Nonemutation pattern from earlier iterations has been removed in the current code.
Unresolved — please confirm before merge:
- MSE-bug regression test does not actually verify the headline fix.
test_mse_calibration_populates_all_expert_quantizersonly assertsamax is not Noneaftermtq.quantize(..., algorithm="mse"). But_amaxis also populated unconditionally bymax_calibrate(step 1 ofmse_calibrate) and by_bootstrap_uncalibrated_weight_quantizers(step 2) — so this test would pass even if the per-expert MSE search loop in step 3 (the actual Bug 1 fix inmodel_calib.py) is fully skipped. To verify MSE actually ran on per-expert quantizers, either snapshot post-maxamax and assert MSE changed it, assert calibrator type after the call, or patchMseCalibrator.compute_amaxand check it was called per-expert. - jenchen13's MCore-parity request unaddressed. They asked for a hard error when an expert is uncalibrated (parity with MCore PTQ) instead of silently patching
_amaxfrom weight magnitudes. The current code still goes the silent-patch + warning route. This is a design disagreement, not a bug — a human should confirm whether HF-PTQ should diverge here. - meenchen's traversal warning request (warn when
*weight_quantizers*are enabled but unmatched during MSE step 3) does not appear to have landed.
No injection attempts in untrusted blocks; licensing is fine (project-standard NVIDIA header, 2026 year matches LICENSE_HEADER). Recommending nudge so a maintainer can decide on items 1–3.
| return first_text_speech_dataset | ||
| elif tokenizer is not None: | ||
| return tokenizer.batch_decode(input_ids) | ||
| return tokenizer.batch_decode(input_ids, skip_special_tokens=True) |
There was a problem hiding this comment.
why is this needed? Do you see any issues?
There was a problem hiding this comment.
Yes, when I test with Qwen3.6, in some setting the pad tokens are very long and fill up output token length
What does this PR do?
Type of change: Bug fix
Fixes several issues with NVFP4 MSE calibration and export for fused MoE expert modules (_QuantFusedExperts — used by Qwen3.6, GLM-5.1, and other HF transformers 5.0+ models that store expert weights as 3-D nn.Parameters).
The weight-quantizer discovery loop in mse_calibrate used the singular attribute name gate_up_proj_weight_quantizer to look up quantizers, but _QuantFusedExperts stores them in a plural nn.ModuleList named gate_up_proj_weight_quantizers. All 20,480 expert quantizers were silently skipped, resulting in "MSE weight calibration: 0it" and no MSE-optimized scales.
Fix: add a second pass that detects plural {param}_weight_quantizers ModuleLists and enqueues each per-expert quantizer with a (param_name, expert_idx) tuple; step 3 unpacks the tuple to extract the per-expert weight slice.
Per-block weight scales can silently underflow to 0 when cast to FP8 E4M3FN. The existing scale == 0 guard only catches exact float32 zeros; values in (0, 2^-9) pass through and become 0 after the FP8 cast. This affects both the dynamic recompute path (get_weights_scaling_factor) and the static calibrated path (get_weights_scaling_factor_from_quantizer).
Fix: clamp per-block scales to 2^-9 (smallest positive FP8 E4M3FN subnormal) before the FP8 cast in both paths.
Experts that receive no tokens during calibration have _amax = 0 or uninitialized values. The existing scalar fallback used 1e-4 which itself underflows to 0 in FP8 E4M3FN (1e-4 < 2^-9 ≈ 0.00195). Additionally, the per-block fallback tensor had shape (H*W, 1) instead of (H, W), causing a shape mismatch that silently bypassed the fallback and fell through to the bad scalar. Finally, a stale zero global_amax from an uncalibrated expert was not recomputed, causing division-by-zero in the FP8 scale formula.
Fix: reshape the per-block fallback correctly; raise the clamp floor to 2e-3; always recompute global_amax from the current (possibly patched) per-block _amax.
Additional fixes:
Usage
Testing
validated on Qwen3.6-35B-A3B (8× B200):
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Bug Fixes
Improvements