Skip to content

Update fused quant broadcast logic (#20171)#20171

Open
DrJessop wants to merge 1 commit into
pytorch:mainfrom
DrJessop:export-D108065588
Open

Update fused quant broadcast logic (#20171)#20171
DrJessop wants to merge 1 commit into
pytorch:mainfrom
DrJessop:export-D108065588

Conversation

@DrJessop

@DrJessop DrJessop commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary:

Unifies QuantParamsStruct (sas_compiler's central quant-params abstraction) onto a single affine-quantization representation and drops the axis argument from every fused-quant op interface.

Core change (ops.py): scale/zero_point are now either a singleton (per-tensor, auto-expanded internally) or a full-rank tensor whose shape encodes the affine block layout — block_size[i] = tensor.shape[i] // scale.shape[i]. This one representation covers per-tensor, per-channel, per-group, and blockwise uniformly. quantize/dequantize delegate to torch.ops.torchao.(de)quantize_affine. The axis field is removed from QuantParamsStruct, all ~60 op-schema fields, _make_qp, and the _lib.define strings. is_per_tensor/is_per_channel/is_per_group and a new channel_axis() helper are now derived from scale shape (channel_axis() returns 0 if all dims are unary, the single non-unary dim if there's
exactly one, else None).

Fusion (fusion_pass.py, fusion_passes/utils.py): the qparams flat block is 6→5 tuple; the per-channel branch inserts an aten.view to make 1-D scales full-rank [1, …, C, …, 1] so their shape encodes the block layout.

Lowering boundary (graph_utils.py, lower_to_turing_linear.py, lower_to_turing_conv_no_nlu_params.py): Helios ParameterExtraction wants the compact scale form ([K] / [K, num_groups]), but the fused op now carries full-rank scales. New compact_scale_node() squeezes size-1 dims at the lowering boundary; the inserted view_copy folds via ConstantPropPass before extraction. Lowering asserts channel_axis() is not None.

Broadcast fix (fuse_mul_into_linear.py): channel_scale is an activation-space [K] vector (out-features is the trailing dim of the mul constant), whereas the weight scale is now full-rank [K, 1] (out-features at dim 0). Reshape the channel factor to [-1, 1, …] so it broadcasts along the weight scale's channel axis instead of producing a [K, K] outer product. The 1-D bias multiply is unchanged (bias is already [K]).

Misc consumers: quant_absorption.py per-tensor check is now out_scale.numel() == 1; BUCK adds the torchao dep.

Reviewed By: ethansfng

Differential Revision: D108065588

@pytorch-bot

pytorch-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20171

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 10, 2026
@meta-codesync

meta-codesync Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@DrJessop has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108065588.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-codesync meta-codesync Bot changed the title Update fused quant broadcast logic Update fused quant broadcast logic (#20171) Jun 10, 2026
@DrJessop DrJessop force-pushed the export-D108065588 branch from d474ec6 to f6c2241 Compare June 10, 2026 17:10
DrJessop added a commit to DrJessop/executorch that referenced this pull request Jun 10, 2026
Summary:

Unifies QuantParamsStruct (sas_compiler's central quant-params abstraction) onto a single affine-quantization representation and drops the axis argument from every fused-quant op interface.

Core change (ops.py): scale/zero_point are now either a singleton (per-tensor, auto-expanded internally) or a full-rank tensor whose shape encodes the affine block layout — block_size[i] = tensor.shape[i] // scale.shape[i]. This one representation covers per-tensor, per-channel, per-group, and blockwise uniformly. quantize/dequantize delegate to torch.ops.torchao.(de)quantize_affine. The axis field is removed from QuantParamsStruct, all ~60 op-schema fields, _make_qp, and the _lib.define strings. is_per_tensor/is_per_channel/is_per_group and a new channel_axis() helper are now derived from scale shape (channel_axis() returns 0 if all dims are unary, the single non-unary dim if there's
exactly one, else None).

Fusion (fusion_pass.py, fusion_passes/utils.py): the qparams flat block is 6→5 tuple; the per-channel branch inserts an aten.view to make 1-D scales full-rank [1, …, C, …, 1] so their shape encodes the block layout.

Lowering boundary (graph_utils.py, lower_to_turing_linear.py, lower_to_turing_conv_no_nlu_params.py): Helios ParameterExtraction wants the compact scale form ([K] / [K, num_groups]), but the fused op now carries full-rank scales. New compact_scale_node() squeezes size-1 dims at the lowering boundary; the inserted view_copy folds via ConstantPropPass before extraction. Lowering asserts channel_axis() is not None.

Broadcast fix (fuse_mul_into_linear.py): channel_scale is an activation-space [K] vector (out-features is the trailing dim of the mul constant), whereas the weight scale is now full-rank [K, 1] (out-features at dim 0). Reshape the channel factor to [-1, 1, …] so it broadcasts along the weight scale's channel axis instead of producing a [K, K] outer product. The 1-D bias multiply is unchanged (bias is already [K]).

Misc consumers: quant_absorption.py per-tensor check is now out_scale.numel() == 1; BUCK adds the torchao dep.

Reviewed By: ethansfng

Differential Revision: D108065588
DrJessop added a commit to DrJessop/executorch that referenced this pull request Jun 10, 2026
Summary:

Unifies QuantParamsStruct (sas_compiler's central quant-params abstraction) onto a single affine-quantization representation and drops the axis argument from every fused-quant op interface.

Core change (ops.py): scale/zero_point are now either a singleton (per-tensor, auto-expanded internally) or a full-rank tensor whose shape encodes the affine block layout — block_size[i] = tensor.shape[i] // scale.shape[i]. This one representation covers per-tensor, per-channel, per-group, and blockwise uniformly. quantize/dequantize delegate to torch.ops.torchao.(de)quantize_affine. The axis field is removed from QuantParamsStruct, all ~60 op-schema fields, _make_qp, and the _lib.define strings. is_per_tensor/is_per_channel/is_per_group and a new channel_axis() helper are now derived from scale shape (channel_axis() returns 0 if all dims are unary, the single non-unary dim if there's
exactly one, else None).

Fusion (fusion_pass.py, fusion_passes/utils.py): the qparams flat block is 6→5 tuple; the per-channel branch inserts an aten.view to make 1-D scales full-rank [1, …, C, …, 1] so their shape encodes the block layout.

Lowering boundary (graph_utils.py, lower_to_turing_linear.py, lower_to_turing_conv_no_nlu_params.py): Helios ParameterExtraction wants the compact scale form ([K] / [K, num_groups]), but the fused op now carries full-rank scales. New compact_scale_node() squeezes size-1 dims at the lowering boundary; the inserted view_copy folds via ConstantPropPass before extraction. Lowering asserts channel_axis() is not None.

Broadcast fix (fuse_mul_into_linear.py): channel_scale is an activation-space [K] vector (out-features is the trailing dim of the mul constant), whereas the weight scale is now full-rank [K, 1] (out-features at dim 0). Reshape the channel factor to [-1, 1, …] so it broadcasts along the weight scale's channel axis instead of producing a [K, K] outer product. The 1-D bias multiply is unchanged (bias is already [K]).

Misc consumers: quant_absorption.py per-tensor check is now out_scale.numel() == 1; BUCK adds the torchao dep.

Reviewed By: ethansfng

Differential Revision: D108065588
@DrJessop DrJessop force-pushed the export-D108065588 branch from f6c2241 to d9034b7 Compare June 10, 2026 17:10
Summary:

Unifies QuantParamsStruct (sas_compiler's central quant-params abstraction) onto a single affine-quantization representation and drops the axis argument from every fused-quant op interface.

Core change (ops.py): scale/zero_point are now either a singleton (per-tensor, auto-expanded internally) or a full-rank tensor whose shape encodes the affine block layout — block_size[i] = tensor.shape[i] // scale.shape[i]. This one representation covers per-tensor, per-channel, per-group, and blockwise uniformly. quantize/dequantize delegate to torch.ops.torchao.(de)quantize_affine. The axis field is removed from QuantParamsStruct, all ~60 op-schema fields, _make_qp, and the _lib.define strings. is_per_tensor/is_per_channel/is_per_group and a new channel_axis() helper are now derived from scale shape (channel_axis() returns 0 if all dims are unary, the single non-unary dim if there's
exactly one, else None).

Fusion (fusion_pass.py, fusion_passes/utils.py): the qparams flat block is 6→5 tuple; the per-channel branch inserts an aten.view to make 1-D scales full-rank [1, …, C, …, 1] so their shape encodes the block layout.

Lowering boundary (graph_utils.py, lower_to_turing_linear.py, lower_to_turing_conv_no_nlu_params.py): Helios ParameterExtraction wants the compact scale form ([K] / [K, num_groups]), but the fused op now carries full-rank scales. New compact_scale_node() squeezes size-1 dims at the lowering boundary; the inserted view_copy folds via ConstantPropPass before extraction. Lowering asserts channel_axis() is not None.

Broadcast fix (fuse_mul_into_linear.py): channel_scale is an activation-space [K] vector (out-features is the trailing dim of the mul constant), whereas the weight scale is now full-rank [K, 1] (out-features at dim 0). Reshape the channel factor to [-1, 1, …] so it broadcasts along the weight scale's channel axis instead of producing a [K, K] outer product. The 1-D bias multiply is unchanged (bias is already [K]).

Misc consumers: quant_absorption.py per-tensor check is now out_scale.numel() == 1; BUCK adds the torchao dep.

Reviewed By: ethansfng

Differential Revision: D108065588
DrJessop added a commit to DrJessop/executorch that referenced this pull request Jun 10, 2026
Summary:

Unifies QuantParamsStruct (sas_compiler's central quant-params abstraction) onto a single affine-quantization representation and drops the axis argument from every fused-quant op interface.

Core change (ops.py): scale/zero_point are now either a singleton (per-tensor, auto-expanded internally) or a full-rank tensor whose shape encodes the affine block layout — block_size[i] = tensor.shape[i] // scale.shape[i]. This one representation covers per-tensor, per-channel, per-group, and blockwise uniformly. quantize/dequantize delegate to torch.ops.torchao.(de)quantize_affine. The axis field is removed from QuantParamsStruct, all ~60 op-schema fields, _make_qp, and the _lib.define strings. is_per_tensor/is_per_channel/is_per_group and a new channel_axis() helper are now derived from scale shape (channel_axis() returns 0 if all dims are unary, the single non-unary dim if there's
exactly one, else None).

Fusion (fusion_pass.py, fusion_passes/utils.py): the qparams flat block is 6→5 tuple; the per-channel branch inserts an aten.view to make 1-D scales full-rank [1, …, C, …, 1] so their shape encodes the block layout.

Lowering boundary (graph_utils.py, lower_to_turing_linear.py, lower_to_turing_conv_no_nlu_params.py): Helios ParameterExtraction wants the compact scale form ([K] / [K, num_groups]), but the fused op now carries full-rank scales. New compact_scale_node() squeezes size-1 dims at the lowering boundary; the inserted view_copy folds via ConstantPropPass before extraction. Lowering asserts channel_axis() is not None.

Broadcast fix (fuse_mul_into_linear.py): channel_scale is an activation-space [K] vector (out-features is the trailing dim of the mul constant), whereas the weight scale is now full-rank [K, 1] (out-features at dim 0). Reshape the channel factor to [-1, 1, …] so it broadcasts along the weight scale's channel axis instead of producing a [K, K] outer product. The 1-D bias multiply is unchanged (bias is already [K]).

Misc consumers: quant_absorption.py per-tensor check is now out_scale.numel() == 1; BUCK adds the torchao dep.

Reviewed By: ethansfng

Differential Revision: D108065588
@DrJessop DrJessop force-pushed the export-D108065588 branch 2 times, most recently from 968c574 to 4674570 Compare June 10, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants