🚀[0.3.37] Release Note: MoE CPU Offloading, N-Gram Speculative Decoding, Thread-Safe Abort & New LLM Wiki #121

JamePeng · 2026-05-02T07:05:56Z

JamePeng
May 2, 2026
Maintainer

🚀 Release v0.3.37: MoE CPU Offloading, N-Gram Speculative Decoding, Thread-Safe Abort & New LLM Wiki

Hi everyone,

Here are the release notes for v0.3.37. In this update, we focused on improving memory routing for large MoE models, speeding up speculative decoding, and giving developers more precise control over the generation loop. I've also started laying the groundwork for a new documentation system.

Here are the key updates:

🧠 Fine-grained MoE CPU Offloading

Running massive Mixture of Experts (MoE) models on consumer hardware usually leads to VRAM bottlenecks. To solve this, I've exposed the underlying ggml buffer APIs to allow precise memory routing.

You can now use cpu_moe=True or n_cpu_moe=N in Llama.__init__.
This allows you to offload the heavy, sparsely-activated MoE expert weights directly to system RAM (CPU), while keeping the compute-intensive attention and router weights on the GPU.
We also improved the readability of n_gpu_layers; you can now pass string literals like "auto" (equivalent to -1) and "all" (equivalent to -2).

🛑 Thread-Safe Abort & Assistant Prefill Integration

We now have a proper, thread-safe generation abort mechanism. Calling Llama.abort() from an external thread (like a UI button or a timeout handler) will safely interrupt the C++ generation loop, preserve the partially generated tokens, and return finish_reason="abort".

A very practical use case for this is combining it with the assistant_prefill feature (and the recent add_generation_prompt updates to the MTMDChatHandler). For example, if a reasoning model gets stuck in an infinite <think> loop, you can:

Trigger abort() via a timeout.
Take the preserved partial text and append a bridge sentence (e.g., "...Wait, I've thought long enough, let's answer.</think>").
Feed this back into the model using assistant_prefill=True. The model will seamlessly pick up from the injected text and output the final answer, without wasting compute context.

⚡ O(1) N-Gram Speculative Decoding

The legacy Numpy-based sliding window approach for speculative decoding had too much CPU overhead for long contexts. I've replaced it with LlamaNGramMapDecoding, which uses a hash inverted index and incremental updates to achieve O(1) time complexity for draft token generation. This is now the recommended default and is significantly faster.

📚 The New LLM-Wiki

I am building a new documentation system under /docs/wiki/. The goal is to move away from manually maintaining static docs and instead use a structured SCHEMA.md that allows AI agents to automatically learn the codebase and keep the Markdown documentation up to date. It's still a work in progress, but the core pages (Llama.md, LlamaCache.md, etc.) have been established.

Now Github Wiki Online: https://github.com/JamePeng/llama-cpp-python/wiki

🛠 Other Improvements & Syncs

MCP & OpenAI Specs: Add comprehensive Model Context Protocol (MCP) type definitions, including MCPTool, MCPToolCall, MCPListTools, connector IDs, and approval filters to support remote server tool calling. Aligned finish_reason and service_tier fields with the latest OpenAI specs.
Upstream Sync: Updated to ggml-org/llama.cpp/commit/63d93d17336e41e4cc73a64451e5b1d2477abdb1 and synced the llama/mtmd API binding.
CI/CD: Upgraded astral-sh/setup-uv to v7 and Jimver/cuda-toolkit@v0.2.35 (Node 24 runtime)

Thanks to everyone who submitted issues and feedback to help shape these features.

— JamePeng

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀[0.3.37] Release Note: MoE CPU Offloading, N-Gram Speculative Decoding, Thread-Safe Abort & New LLM Wiki #121

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

🚀[0.3.37] Release Note: MoE CPU Offloading, N-Gram Speculative Decoding, Thread-Safe Abort & New LLM Wiki #121

Uh oh!

Uh oh!

JamePeng May 2, 2026 Maintainer

🚀 Release v0.3.37: MoE CPU Offloading, N-Gram Speculative Decoding, Thread-Safe Abort & New LLM Wiki

🧠 Fine-grained MoE CPU Offloading

🛑 Thread-Safe Abort & Assistant Prefill Integration

⚡ O(1) N-Gram Speculative Decoding

📚 The New LLM-Wiki

🛠 Other Improvements & Syncs

Replies: 0 comments

JamePeng
May 2, 2026
Maintainer