🚀[0.3.37] Release Note: MoE CPU Offloading, N-Gram Speculative Decoding, Thread-Safe Abort & New LLM Wiki #121
JamePeng
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
🚀 Release v0.3.37: MoE CPU Offloading, N-Gram Speculative Decoding, Thread-Safe Abort & New LLM Wiki
Hi everyone,
Here are the release notes for v0.3.37. In this update, we focused on improving memory routing for large MoE models, speeding up speculative decoding, and giving developers more precise control over the generation loop. I've also started laying the groundwork for a new documentation system.
Here are the key updates:
🧠 Fine-grained MoE CPU Offloading
Running massive Mixture of Experts (MoE) models on consumer hardware usually leads to VRAM bottlenecks. To solve this, I've exposed the underlying ggml buffer APIs to allow precise memory routing.
cpu_moe=Trueorn_cpu_moe=NinLlama.__init__.n_gpu_layers; you can now pass string literals like"auto"(equivalent to -1) and"all"(equivalent to -2).🛑 Thread-Safe Abort & Assistant Prefill Integration
We now have a proper, thread-safe generation abort mechanism. Calling
Llama.abort()from an external thread (like a UI button or a timeout handler) will safely interrupt the C++ generation loop, preserve the partially generated tokens, and returnfinish_reason="abort".A very practical use case for this is combining it with the
assistant_prefillfeature (and the recentadd_generation_promptupdates to theMTMDChatHandler). For example, if a reasoning model gets stuck in an infinite<think>loop, you can:abort()via a timeout."...Wait, I've thought long enough, let's answer.</think>").assistant_prefill=True. The model will seamlessly pick up from the injected text and output the final answer, without wasting compute context.⚡ O(1) N-Gram Speculative Decoding
The legacy Numpy-based sliding window approach for speculative decoding had too much CPU overhead for long contexts. I've replaced it with
LlamaNGramMapDecoding, which uses a hash inverted index and incremental updates to achieve O(1) time complexity for draft token generation. This is now the recommended default and is significantly faster.📚 The New LLM-Wiki
I am building a new documentation system under
/docs/wiki/. The goal is to move away from manually maintaining static docs and instead use a structuredSCHEMA.mdthat allows AI agents to automatically learn the codebase and keep the Markdown documentation up to date. It's still a work in progress, but the core pages (Llama.md,LlamaCache.md, etc.) have been established.Now Github Wiki Online: https://github.com/JamePeng/llama-cpp-python/wiki
🛠 Other Improvements & Syncs
MCPTool,MCPToolCall,MCPListTools, connector IDs, and approval filters to support remote server tool calling. Alignedfinish_reasonandservice_tierfields with the latest OpenAI specs.astral-sh/setup-uvto v7 andJimver/cuda-toolkit@v0.2.35(Node 24 runtime)Thanks to everyone who submitted issues and feedback to help shape these features.
— JamePeng
Beta Was this translation helpful? Give feedback.
All reactions