Skip to content

fix(_internals): use n_tokens0 offset when enabling last-token logits in add_sequence#2205

Open
Anai-Guo wants to merge 1 commit intoabetlen:mainfrom
Anai-Guo:fix-add-sequence-logits-index
Open

fix(_internals): use n_tokens0 offset when enabling last-token logits in add_sequence#2205
Anai-Guo wants to merge 1 commit intoabetlen:mainfrom
Anai-Guo:fix-add-sequence-logits-index

Conversation

@Anai-Guo
Copy link
Copy Markdown

@Anai-Guo Anai-Guo commented May 3, 2026

Problem

Fixes #2199.

llm.embed(["test", "test"]) produces wrong results (or fails downstream) for sequence indices ≥ 1, while [llm.embed(x) for x in ["test", "test"]] works.

Root cause

In llama_cpp/_internals.py LlamaBatch.add_sequence:

def add_sequence(self, batch: Sequence[int], seq_id: int, logits_all: bool):
    n_tokens = len(batch)
    n_tokens0 = self.batch.n_tokens   # offset of THIS sequence inside the running batch
    self.batch.n_tokens += n_tokens
    for i in range(n_tokens):
        j = n_tokens0 + i             # correctly offsets every per-token write
        self.batch.token[j] = batch[i]
        ...
        self.batch.logits[j] = logits_all
    self.batch.logits[n_tokens - 1] = True   # <-- BUG: missing n_tokens0

Every per-token write inside the loop correctly uses the offset j = n_tokens0 + i, but the trailing "force-enable last-token logits" line uses the local n_tokens - 1 instead of n_tokens0 + n_tokens - 1.

Effect with embed(["test", "test"]) (each tokenized to 3 tokens, no pooling):

call n_tokens0 last-token index of THIS seq what the buggy line writes
1st (seq_id=0) 0 2 logits[2] = True
2nd (seq_id=1) 3 5 logits[2] = True ✗ — already set, real last token (index 5) is left at logits_all (False for pooled embeddings)

For the second sequence, the last token's logits flag is never enabled, so llama_get_embeddings_seq(ctx, 1) returns garbage / unset data depending on the pooling type. The first sequence's embedding looks fine, which is exactly the asymmetric "first works, batched fails" behavior reported in the issue.

It also overlaps with seq 0's last-token flag, so even when both flags happen to coincide for trivial inputs, any non-trivial batch where sequence sizes differ exposes the bug immediately.

Fix

Use the same offset n_tokens0 that the per-token writes already use:

self.batch.logits[n_tokens0 + n_tokens - 1] = True

This is a one-line change in LlamaBatch.add_sequence. The sibling method LlamaBatch.set_batch is unaffected because it always starts from index 0 (self.batch.n_tokens = n_tokens, not +=).

Verification

After the fix, the failing reproduction from the issue:

llm = Llama(model_path="all-MiniLM-L6-v2-Q8_0.gguf", embedding=True)
llm.embed(["test", "test"])

returns two embeddings whose values match the per-string call:

[llm.embed(x) for x in ["test", "test"]]

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error with embed when passing a List

1 participant