Skip to content

Fix Word Salad (Tensor Corruption) on ARM Devices for i2_s Quantization#551

Open
uno-km wants to merge 5 commits intomicrosoft:mainfrom
uno-km:main
Open

Fix Word Salad (Tensor Corruption) on ARM Devices for i2_s Quantization#551
uno-km wants to merge 5 commits intomicrosoft:mainfrom
uno-km:main

Conversation

@uno-km
Copy link
Copy Markdown

@uno-km uno-km commented Apr 26, 2026

Hello,
The code has been updated to resolve ARM architecture build failures. The entire process of identifying the root causes, applying fixes, and running final tests to ensure stability has been completed.
While Gemini was used for the initial draft, the final solution is based on my own technical validation and experience. For more details on how these ARM-specific issues were tackled, please check my blog post:
https://uno-kim.tistory.com/462
Best regards,

🐛 The Problem

When running models quantized to i2_s (1.58-bit) on non-Apple ARM devices (e.g., Android devices with Exynos/Snapdragon via Termux/PRoot), the model produces severe "Word Salad" (meaningless token generation).

🔍 Root Cause Analysis

The root cause was a memory layout mismatch between the x86 packing phase and the ARM NEON unpacking phase:

  1. QK_I2_S was hardcoded to 64 for __ARM_NEON in the macro definition, while the GGUF models packed on x86 strictly enforce QK=128.
  2. The packing logic in quantize_i2_s for NEON used a 16-stride jump, whereas AVX2 uses a 32-stride.
  3. The decoding kernels (ggml_vec_dot_i2_i8_s_*) had hardcoded loop unrolling that assumed a 64-block size, causing out-of-bounds padding reads and complete accumulator corruption.

🛠️ Changes Implemented

  • Macro Standardization: Enforced #define QK_I2_S 128 across all architectures.
  • Packing Sync: Updated quantize_i2_s for NEON to use the exact same 32-stride packing layout as AVX2.
  • Kernel Rewrite: Completely refactored the 1x1, 1xN, and Nx1 NEON kernels. Removed the legacy QK=64 loop unrolling and replaced it with a dynamic block loop (nb = n / 128).
  • Bit Extraction: Aligned the 2-bit MSB-to-LSB extraction sequence to be 100% mathematically identical to the _mm256_srli_epi16 logic.
  • Safety Upgrade: Changed the final horizontal sum to vaddlvq_s32 (64-bit) to ensure absolute safety against overflow during massive multi-threaded prompt evaluations.

🧪 Testing

  • Device: Samsung Galaxy A35 (Exynos 1380, ARMv8.2-A).
  • Environment: Ubuntu PRoot via Termux.
  • Result: Successfully generates coherent text with full multi-threading (-t 8). The DotProd (__ARM_FEATURE_DOTPROD) hardware acceleration and FMA fallbacks both work flawlessly.

uno-km added 5 commits April 24, 2026 17:02
Changes:

Unified QK Standard: Strictly enforced QK_I2_S = 128 across NEON and Scalar paths to match the standard GGUF packing layout.

Refactored Loop Logic: Removed legacy group32_num and la_num chunks. Replaced with a clean, block-level loop to prevent pointer corruption.

NEON Optimization: Implemented a dual 16-byte chunk load strategy within the 32-byte weight block to maximize SIMD register utilization.

Mathematical Alignment:

Synchronized bit-unpacking order (MSB to LSB) with the AVX2 reference.

Implemented 32-stride interleaved memory fetching for activations (y).

Removed redundant (-1) offset mapping to leverage zero-mean distribution properties, matching the high-performance AVX2 kernel behavior.

Result:

Completely resolved the word salad issue on Exynos/Snapdragon chips.

Validated logical consistency across AVX2, NEON, and Pure C++ Scalar fallback paths.
…g QK layout to 128

- Standardized `QK_I2_S` to 128 for `__ARM_NEON` to match the x86 GGUF packing standard.
- Fixed memory misalignment in `quantize_i2_s` by updating the packing stride to 32.
- Refactored `ggml_vec_dot_i2_i8_s` NEON kernels (1x1, 1xN, Nx1) to use a dynamic block-level loop (`nb = n / QK`) instead of hardcoded 64-stride loop unrolling.
- Aligned interleaved memory fetching (`vld1q_s8`) with the AVX2 logic.
- Upgraded accumulator horizontal sum to `vaddlvq_s32` (64-bit) to prevent potential 32-bit integer overflow in extended context scenarios.

Tested on Exynos 1380 (Android PRoot) with `-t 8`. Output generation is now 100% stable without word salad.
@uno-km
Copy link
Copy Markdown
Author

uno-km commented Apr 26, 2026

@microsoft-github-policy-service agree

@betovildoza
Copy link
Copy Markdown

Hey,
We'll keep an eye on this PR. We experienced very similar "word salad" / tensor corruption issues on Oracle Cloud ARM Ampere A1 with the official i2_s model (see #468 and #470).
Looking forward once it's ready.

@uno-km
Copy link
Copy Markdown
Author

uno-km commented Apr 27, 2026

Hi @betovildoza,

Thank you for reviewing! After reading through Issue #468, I can confidently say that this PR is the exact antidote for Bug 3 (GGGG garbage) and Bug 4 (Word Salad) you mentioned.

I noticed you suspected act_sums or scale offsets for Bug 4. However, the true root cause was much deeper in the memory layout phase.
The x86 packing phase enforces QK=128 (32-stride), but the __ARM_NEON macros and kernel loop unrolling were strictly hardcoded to QK=64 (16-stride). Because of this mismatch, the ARM NEON kernels were unpacking out-of-bounds memory and reading complete garbage, which naturally led to the semantic incoherence (Word Salad) regardless of the decoding path.

By synchronizing the NEON packing logic to a 32-stride and replacing the hardcoded 64-block loop with a dynamic 128-block logic, both the GGGG issue and the Word Salad are completely resolved. Also, changing the horizontal sum to vaddlvq_s32 prevented the accumulator overflow.

It's running perfectly on my Exynos environment now. I'm highly confident this will bring your Oracle Ampere A1 back to life. Let me know if you need any help testing it on your end!

@Jozeh
Copy link
Copy Markdown

Jozeh commented May 3, 2026

Tested PR #551 on old x86_64 Ivy Bridge CPU

I tested PR #551 on an old x86_64 Ivy Bridge CPU and it appears to fix the repeated GGGG generation failure for my machine.

Hardware

  • Intel(R) Core(TM) i5-3335S CPU @ 2.70GHz
  • x86_64 / Ivy Bridge
  • AVX = 1
  • AVX2 = 0
  • FMA = 0
  • F16C = 1
  • SSE3 = 1
  • SSSE3 = 1
  • MATMUL_INT8 = 0

Runtime system_info reports:

AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

Environment

  • Ubuntu 22.04.5
  • Clang 20.1.8
  • CMake 3.22
  • Python via Poetry

Model

  • microsoft/bitnet-b1.58-2B-4T-gguf
  • ggml-model-i2_s.gguf
  • Quant: i2_s

Command used on both current main and PR #551

./build/bin/llama-cli \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -n 128 \
  -t 4 \
  -p "Explain Microsoft's BitNet b1.58 1-bit large language model in one short paragraph." \
  -ngl 0 \
  -c 2048 \
  --temp 0.1 \
  -b 1

Result on current main

The model loads successfully but generation collapses into repeated G characters:

Explain Microsoft's BitNet b1.58 1-bit large language model in one short paragraph.GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Result on PR #551

With the same model, same prompt, same flags, and same machine, PR #551 no longer collapses into repeated G output:

Explain Microsoft's BitNet b1.58 1-bit large language model in one short paragraph. Microsoft's BitNet b1-58 is a 1-bit large language model that uses a novel approach to generate text. It employs a unique method of generating text by using a single bit as a random seed for each output token. This approach allows the model to produce a diverse range of outputs, making it suitable for various applications. The model's simplicity and efficiency make it an attractive option for developers and researchers looking for a versatile and powerful language model.

Answer: Microsoft's BitNet b1-58 is a 1-bit large language model that uses a single bit as a random seed for each output token, enabling diverse text generation.

Notes

The generated answer is not factually perfect, but the important part is that the repeated GGGG corruption is gone on PR #551.

This suggests PR #551 may also help the x86 AVX1/no-AVX2 fallback path, not just ARM NEON.

My CPU is a useful repro case because it has AVX/F16C but no AVX2 or FMA:

Intel(R) Core(TM) i5-3335S CPU @ 2.70GHz
AVX = 1
AVX2 = 0
FMA = 0

Happy to test any follow-up patches on this Ivy Bridge machine.

Benchmark on PR #551

The official benchmark also runs successfully on the PR branch:

poetry run python utils/e2e_benchmark.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -n 64 \
  -p 128 \
  -t 4

Result:

| bitnet-b1.58 2B I2_S - 2 bpw ternary | 1.71 GiB | 2.74 B | CPU | 4 | 1 | pp128 | 5.81 ± 0.08 |
| bitnet-b1.58 2B I2_S - 2 bpw ternary | 1.71 GiB | 2.74 B | CPU | 4 | 1 | tg64  | 5.45 ± 0.31 |

So PR #551 not only avoids the repeated GGGG output on this AVX1/no-AVX2 CPU, but also completes the official benchmark path with n_batch = 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants