JIT: Fix arm64 shifted-register containment for long compares#129188
Conversation
In Lowering::IsContainableUnaryOrBinaryOp, the maximum shift amount
permitted for a contained shift (LSH/RSH/RSZ) was computed from the
parent node's type. For comparison parents (GT_EQ/GT_NE/GT_LT/etc. and
GT_TEST_EQ/GT_TEST_NE) the parent's type is TYP_INT regardless of
operand width, so any shift amount in 32..63 fed into a 64-bit
comparison would fail containment. As a result, we emitted
lsl x1, x1, dotnet#55
cmp x0, x1
instead of the optimized
cmp x0, x1, LSL dotnet#55
and similarly for ANDS-feeding patterns like (a & (b >> 38)) != 0.
Use the shift node's own type to derive maxShift; for binary parents
where parent and child types agree this is unchanged, and for compares
it gives the correct operand width (31 for int, 63 for long).
Update the existing FileCheck tests in Cmp.cs and And.cs that
documented the old buggy codegen, and add CmpLargeShift64BitSwap to
cover the swapped-operand long-compare path.
Fixes dotnet#111889
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@dhartglassMSFT PTAL Small handful of diffs. |
There was a problem hiding this comment.
Pull request overview
This PR updates ARM64 lowering containment logic so shifted-register forms can be used for 64-bit comparisons/tests when the shift amount is in the valid 0..63 range for 64-bit operands (rather than incorrectly limiting based on the comparison node’s TYP_INT result type). This improves codegen by enabling combined cmp/tst with shifted-register operands instead of materializing the shift into a temporary register first.
Changes:
- Fix
Lowering::IsContainableUnaryOrBinaryOpto compute the maximum allowed contained-shift amount from the shift node’s type (operand width) rather than the parent node’s type. - Update existing ARM64 FileCheck expectations in JIT instruction-combining tests to assert the optimized
cmp/tst ... (LSL/LSR) #immforms. - Add a new swap-operand long-compare test case (
CmpLargeShift64BitSwap) to cover the “shift initially on op1, then swapped to op2 for containment” path.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/coreclr/jit/lowerarmarch.cpp | Fixes shift containment max-shift computation to use the shift node’s operand width, enabling containment for long compares/tests. |
| src/tests/JIT/opt/InstructionCombining/Cmp.cs | Updates ARM64 FileCheck for long-compare large shift and adds swapped-operand coverage expecting contained-shift cmp. |
| src/tests/JIT/opt/InstructionCombining/And.cs | Updates ARM64 FileCheck for 64-bit large shift AND/test to expect contained-shift tst. |
|
diffs show there are some bad interactions here. Let me look deeper. |
TryLowerAndOrToCCMP previously required the CCMP-side relop's operands to be free of (non-const, non-memory) contained nodes and bailed otherwise. After dotnet#129188 made arm64 contain long-compare shifts (cmp x, x, ASR dotnet#63), patterns like `(a == (b >> 63)) | (a == (c >> 63))` now have contained shifts on both relops feeding the OR, and CCMP bails entirely. The cset/orr/cbnz fallback is strictly worse than letting CCMP form with one un-contained operand. ARM64 ccmp (register) only accepts a register or imm5 second operand (no shifted-register form); xarch APX ccmp likewise only re-contains immediates in ContainCheckConditionalCompare. The existing post- conversion ClearContained() calls already un-contain the CCMP-side operands, so allowing the conversion in this case just yields the intended (cmp shifted + uncontained-op + ccmp + branch) sequence. The leading cmp on the other side keeps its contained operand (cmp supports shifted-register on arm64 and memory on xarch). To preserve memory containment on the leading side, the new "clean operands" check excludes memory ops from being considered clean for the CCMP *side*: a memory load that would have to be un-contained for ccmp is better left on the leading cmp. Effect on SPMI (osx.arm64): - libraries.pmi: 93 contexts, 93 size improvements, 0 regressions, -880 bytes, 93 PerfScore improvements, 0 PerfScore regressions - coreclr_tests: 25,106 contexts, 25,056 size improvements, 17 size regressions (reg-alloc noise, +128 bytes), -102,948 bytes overall, 25,106 PerfScore improvements, 0 PerfScore regressions Without this change PR dotnet#129188's containment fix regresses coreclr_tests by +99,892 bytes due to widely-replicated TimeSpan-like overflow check patterns (`cmp x, x, ASR dotnet#63 ; ccmp x, x, z, ne`) falling back to cset/orr chains. With the change those same patterns become (asr ; cmp x, x, ASR dotnet#63 ; ccmp ; branch), one instruction better than the original code. Adds compareAnd2Chains.Eq_long_shifted_2_consume to lock in the sequence. Co-authored-by: Andy Ayers <andya@microsoft.com>
|
new diffs look good; no regressions. Improvements though are mostly in tests. |
|
Failure is unrelated, this change is arm64 only. |
In Lowering::IsContainableUnaryOrBinaryOp, the maximum shift amount permitted for a contained shift (LSH/RSH/RSZ) was computed from the parent node's type. For comparison parents (GT_EQ/GT_NE/GT_LT/etc. and GT_TEST_EQ/GT_TEST_NE) the parent's type is TYP_INT regardless of operand width, so any shift amount in 32..63 fed into a 64-bit comparison would fail containment. As a result, we emitted
instead of the optimized
and similarly for ANDS-feeding patterns like (a & (b >> 38)) != 0.
Use the shift node's own type to derive maxShift; for binary parents where parent and child types agree this is unchanged, and for compares it gives the correct operand width (31 for int, 63 for long).
Update the existing FileCheck tests in Cmp.cs and And.cs that documented the old buggy codegen, and add CmpLargeShift64BitSwap to cover the swapped-operand long-compare path.
Fixes #111889