Update fused quant broadcast logic (#20171)#20171
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20171
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@DrJessop has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108065588. |
This PR needs a
|
d474ec6 to
f6c2241
Compare
Summary: Unifies QuantParamsStruct (sas_compiler's central quant-params abstraction) onto a single affine-quantization representation and drops the axis argument from every fused-quant op interface. Core change (ops.py): scale/zero_point are now either a singleton (per-tensor, auto-expanded internally) or a full-rank tensor whose shape encodes the affine block layout — block_size[i] = tensor.shape[i] // scale.shape[i]. This one representation covers per-tensor, per-channel, per-group, and blockwise uniformly. quantize/dequantize delegate to torch.ops.torchao.(de)quantize_affine. The axis field is removed from QuantParamsStruct, all ~60 op-schema fields, _make_qp, and the _lib.define strings. is_per_tensor/is_per_channel/is_per_group and a new channel_axis() helper are now derived from scale shape (channel_axis() returns 0 if all dims are unary, the single non-unary dim if there's exactly one, else None). Fusion (fusion_pass.py, fusion_passes/utils.py): the qparams flat block is 6→5 tuple; the per-channel branch inserts an aten.view to make 1-D scales full-rank [1, …, C, …, 1] so their shape encodes the block layout. Lowering boundary (graph_utils.py, lower_to_turing_linear.py, lower_to_turing_conv_no_nlu_params.py): Helios ParameterExtraction wants the compact scale form ([K] / [K, num_groups]), but the fused op now carries full-rank scales. New compact_scale_node() squeezes size-1 dims at the lowering boundary; the inserted view_copy folds via ConstantPropPass before extraction. Lowering asserts channel_axis() is not None. Broadcast fix (fuse_mul_into_linear.py): channel_scale is an activation-space [K] vector (out-features is the trailing dim of the mul constant), whereas the weight scale is now full-rank [K, 1] (out-features at dim 0). Reshape the channel factor to [-1, 1, …] so it broadcasts along the weight scale's channel axis instead of producing a [K, K] outer product. The 1-D bias multiply is unchanged (bias is already [K]). Misc consumers: quant_absorption.py per-tensor check is now out_scale.numel() == 1; BUCK adds the torchao dep. Reviewed By: ethansfng Differential Revision: D108065588
Summary: Unifies QuantParamsStruct (sas_compiler's central quant-params abstraction) onto a single affine-quantization representation and drops the axis argument from every fused-quant op interface. Core change (ops.py): scale/zero_point are now either a singleton (per-tensor, auto-expanded internally) or a full-rank tensor whose shape encodes the affine block layout — block_size[i] = tensor.shape[i] // scale.shape[i]. This one representation covers per-tensor, per-channel, per-group, and blockwise uniformly. quantize/dequantize delegate to torch.ops.torchao.(de)quantize_affine. The axis field is removed from QuantParamsStruct, all ~60 op-schema fields, _make_qp, and the _lib.define strings. is_per_tensor/is_per_channel/is_per_group and a new channel_axis() helper are now derived from scale shape (channel_axis() returns 0 if all dims are unary, the single non-unary dim if there's exactly one, else None). Fusion (fusion_pass.py, fusion_passes/utils.py): the qparams flat block is 6→5 tuple; the per-channel branch inserts an aten.view to make 1-D scales full-rank [1, …, C, …, 1] so their shape encodes the block layout. Lowering boundary (graph_utils.py, lower_to_turing_linear.py, lower_to_turing_conv_no_nlu_params.py): Helios ParameterExtraction wants the compact scale form ([K] / [K, num_groups]), but the fused op now carries full-rank scales. New compact_scale_node() squeezes size-1 dims at the lowering boundary; the inserted view_copy folds via ConstantPropPass before extraction. Lowering asserts channel_axis() is not None. Broadcast fix (fuse_mul_into_linear.py): channel_scale is an activation-space [K] vector (out-features is the trailing dim of the mul constant), whereas the weight scale is now full-rank [K, 1] (out-features at dim 0). Reshape the channel factor to [-1, 1, …] so it broadcasts along the weight scale's channel axis instead of producing a [K, K] outer product. The 1-D bias multiply is unchanged (bias is already [K]). Misc consumers: quant_absorption.py per-tensor check is now out_scale.numel() == 1; BUCK adds the torchao dep. Reviewed By: ethansfng Differential Revision: D108065588
f6c2241 to
d9034b7
Compare
Summary: Unifies QuantParamsStruct (sas_compiler's central quant-params abstraction) onto a single affine-quantization representation and drops the axis argument from every fused-quant op interface. Core change (ops.py): scale/zero_point are now either a singleton (per-tensor, auto-expanded internally) or a full-rank tensor whose shape encodes the affine block layout — block_size[i] = tensor.shape[i] // scale.shape[i]. This one representation covers per-tensor, per-channel, per-group, and blockwise uniformly. quantize/dequantize delegate to torch.ops.torchao.(de)quantize_affine. The axis field is removed from QuantParamsStruct, all ~60 op-schema fields, _make_qp, and the _lib.define strings. is_per_tensor/is_per_channel/is_per_group and a new channel_axis() helper are now derived from scale shape (channel_axis() returns 0 if all dims are unary, the single non-unary dim if there's exactly one, else None). Fusion (fusion_pass.py, fusion_passes/utils.py): the qparams flat block is 6→5 tuple; the per-channel branch inserts an aten.view to make 1-D scales full-rank [1, …, C, …, 1] so their shape encodes the block layout. Lowering boundary (graph_utils.py, lower_to_turing_linear.py, lower_to_turing_conv_no_nlu_params.py): Helios ParameterExtraction wants the compact scale form ([K] / [K, num_groups]), but the fused op now carries full-rank scales. New compact_scale_node() squeezes size-1 dims at the lowering boundary; the inserted view_copy folds via ConstantPropPass before extraction. Lowering asserts channel_axis() is not None. Broadcast fix (fuse_mul_into_linear.py): channel_scale is an activation-space [K] vector (out-features is the trailing dim of the mul constant), whereas the weight scale is now full-rank [K, 1] (out-features at dim 0). Reshape the channel factor to [-1, 1, …] so it broadcasts along the weight scale's channel axis instead of producing a [K, K] outer product. The 1-D bias multiply is unchanged (bias is already [K]). Misc consumers: quant_absorption.py per-tensor check is now out_scale.numel() == 1; BUCK adds the torchao dep. Reviewed By: ethansfng Differential Revision: D108065588
Summary: Unifies QuantParamsStruct (sas_compiler's central quant-params abstraction) onto a single affine-quantization representation and drops the axis argument from every fused-quant op interface. Core change (ops.py): scale/zero_point are now either a singleton (per-tensor, auto-expanded internally) or a full-rank tensor whose shape encodes the affine block layout — block_size[i] = tensor.shape[i] // scale.shape[i]. This one representation covers per-tensor, per-channel, per-group, and blockwise uniformly. quantize/dequantize delegate to torch.ops.torchao.(de)quantize_affine. The axis field is removed from QuantParamsStruct, all ~60 op-schema fields, _make_qp, and the _lib.define strings. is_per_tensor/is_per_channel/is_per_group and a new channel_axis() helper are now derived from scale shape (channel_axis() returns 0 if all dims are unary, the single non-unary dim if there's exactly one, else None). Fusion (fusion_pass.py, fusion_passes/utils.py): the qparams flat block is 6→5 tuple; the per-channel branch inserts an aten.view to make 1-D scales full-rank [1, …, C, …, 1] so their shape encodes the block layout. Lowering boundary (graph_utils.py, lower_to_turing_linear.py, lower_to_turing_conv_no_nlu_params.py): Helios ParameterExtraction wants the compact scale form ([K] / [K, num_groups]), but the fused op now carries full-rank scales. New compact_scale_node() squeezes size-1 dims at the lowering boundary; the inserted view_copy folds via ConstantPropPass before extraction. Lowering asserts channel_axis() is not None. Broadcast fix (fuse_mul_into_linear.py): channel_scale is an activation-space [K] vector (out-features is the trailing dim of the mul constant), whereas the weight scale is now full-rank [K, 1] (out-features at dim 0). Reshape the channel factor to [-1, 1, …] so it broadcasts along the weight scale's channel axis instead of producing a [K, K] outer product. The 1-D bias multiply is unchanged (bias is already [K]). Misc consumers: quant_absorption.py per-tensor check is now out_scale.numel() == 1; BUCK adds the torchao dep. Reviewed By: ethansfng Differential Revision: D108065588
968c574 to
4674570
Compare
Summary:
Unifies QuantParamsStruct (sas_compiler's central quant-params abstraction) onto a single affine-quantization representation and drops the axis argument from every fused-quant op interface.
Core change (ops.py): scale/zero_point are now either a singleton (per-tensor, auto-expanded internally) or a full-rank tensor whose shape encodes the affine block layout — block_size[i] = tensor.shape[i] // scale.shape[i]. This one representation covers per-tensor, per-channel, per-group, and blockwise uniformly. quantize/dequantize delegate to torch.ops.torchao.(de)quantize_affine. The axis field is removed from QuantParamsStruct, all ~60 op-schema fields, _make_qp, and the _lib.define strings. is_per_tensor/is_per_channel/is_per_group and a new channel_axis() helper are now derived from scale shape (channel_axis() returns 0 if all dims are unary, the single non-unary dim if there's
exactly one, else None).
Fusion (fusion_pass.py, fusion_passes/utils.py): the qparams flat block is 6→5 tuple; the per-channel branch inserts an aten.view to make 1-D scales full-rank [1, …, C, …, 1] so their shape encodes the block layout.
Lowering boundary (graph_utils.py, lower_to_turing_linear.py, lower_to_turing_conv_no_nlu_params.py): Helios ParameterExtraction wants the compact scale form ([K] / [K, num_groups]), but the fused op now carries full-rank scales. New compact_scale_node() squeezes size-1 dims at the lowering boundary; the inserted view_copy folds via ConstantPropPass before extraction. Lowering asserts channel_axis() is not None.
Broadcast fix (fuse_mul_into_linear.py): channel_scale is an activation-space [K] vector (out-features is the trailing dim of the mul constant), whereas the weight scale is now full-rank [K, 1] (out-features at dim 0). Reshape the channel factor to [-1, 1, …] so it broadcasts along the weight scale's channel axis instead of producing a [K, K] outer product. The 1-D bias multiply is unchanged (bias is already [K]).
Misc consumers: quant_absorption.py per-tensor check is now out_scale.numel() == 1; BUCK adds the torchao dep.
Reviewed By: ethansfng
Differential Revision: D108065588