fix(_internals): use n_tokens0 offset when enabling last-token logits in add_sequence#2205
Open
Anai-Guo wants to merge 1 commit intoabetlen:mainfrom
Open
fix(_internals): use n_tokens0 offset when enabling last-token logits in add_sequence#2205Anai-Guo wants to merge 1 commit intoabetlen:mainfrom
Anai-Guo wants to merge 1 commit intoabetlen:mainfrom
Conversation
…multi-sequence batches
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Fixes #2199.
llm.embed(["test", "test"])produces wrong results (or fails downstream) for sequence indices ≥ 1, while[llm.embed(x) for x in ["test", "test"]]works.Root cause
In
llama_cpp/_internals.pyLlamaBatch.add_sequence:Every per-token write inside the loop correctly uses the offset
j = n_tokens0 + i, but the trailing "force-enable last-token logits" line uses the localn_tokens - 1instead ofn_tokens0 + n_tokens - 1.Effect with
embed(["test", "test"])(each tokenized to 3 tokens, no pooling):logits[2] = True✓logits[2] = True✗ — already set, real last token (index 5) is left atlogits_all(False for pooled embeddings)For the second sequence, the last token's
logitsflag is never enabled, sollama_get_embeddings_seq(ctx, 1)returns garbage / unset data depending on the pooling type. The first sequence's embedding looks fine, which is exactly the asymmetric "first works, batched fails" behavior reported in the issue.It also overlaps with seq 0's last-token flag, so even when both flags happen to coincide for trivial inputs, any non-trivial batch where sequence sizes differ exposes the bug immediately.
Fix
Use the same offset
n_tokens0that the per-token writes already use:This is a one-line change in
LlamaBatch.add_sequence. The sibling methodLlamaBatch.set_batchis unaffected because it always starts from index 0 (self.batch.n_tokens = n_tokens, not+=).Verification
After the fix, the failing reproduction from the issue:
returns two embeddings whose values match the per-string call:
🤖 Generated with Claude Code