fix(_internals): use n_tokens0 offset when enabling last-token logits in add_sequence by Anai-Guo · Pull Request #2205 · abetlen/llama-cpp-python

Anai-Guo · 2026-05-03T10:19:28Z

Problem

llm.embed(["test", "test"]) produces wrong results (or fails downstream) for sequence indices ≥ 1, while [llm.embed(x) for x in ["test", "test"]] works.

Root cause

In llama_cpp/_internals.py LlamaBatch.add_sequence:

def add_sequence(self, batch: Sequence[int], seq_id: int, logits_all: bool):
    n_tokens = len(batch)
    n_tokens0 = self.batch.n_tokens   # offset of THIS sequence inside the running batch
    self.batch.n_tokens += n_tokens
    for i in range(n_tokens):
        j = n_tokens0 + i             # correctly offsets every per-token write
        self.batch.token[j] = batch[i]
        ...
        self.batch.logits[j] = logits_all
    self.batch.logits[n_tokens - 1] = True   # <-- BUG: missing n_tokens0

Every per-token write inside the loop correctly uses the offset j = n_tokens0 + i, but the trailing "force-enable last-token logits" line uses the local n_tokens - 1 instead of n_tokens0 + n_tokens - 1.

Effect with embed(["test", "test"]) (each tokenized to 3 tokens, no pooling):

call	n_tokens0	last-token index of THIS seq	what the buggy line writes
1st (seq_id=0)	0	2	`logits[2] = True` ✓
2nd (seq_id=1)	3	5	`logits[2] = True` ✗ — already set, real last token (index 5) is left at `logits_all` (False for pooled embeddings)

For the second sequence, the last token's logits flag is never enabled, so llama_get_embeddings_seq(ctx, 1) returns garbage / unset data depending on the pooling type. The first sequence's embedding looks fine, which is exactly the asymmetric "first works, batched fails" behavior reported in the issue.

It also overlaps with seq 0's last-token flag, so even when both flags happen to coincide for trivial inputs, any non-trivial batch where sequence sizes differ exposes the bug immediately.

Fix

Use the same offset n_tokens0 that the per-token writes already use:

self.batch.logits[n_tokens0 + n_tokens - 1] = True

This is a one-line change in LlamaBatch.add_sequence. The sibling method LlamaBatch.set_batch is unaffected because it always starts from index 0 (self.batch.n_tokens = n_tokens, not +=).

Verification

After the fix, the failing reproduction from the issue:

llm = Llama(model_path="all-MiniLM-L6-v2-Q8_0.gguf", embedding=True)
llm.embed(["test", "test"])

returns two embeddings whose values match the per-string call:

[llm.embed(x) for x in ["test", "test"]]

🤖 Generated with Claude Code

…multi-sequence batches

Anai-Guo and others added 2 commits May 3, 2026 03:19

fix(_internals): correct logits index in LlamaBatch.add_sequence for …

6ff43ba

…multi-sequence batches

docs: update changelog for batched embeddings fix

a3e35c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(_internals): use n_tokens0 offset when enabling last-token logits in add_sequence#2205

fix(_internals): use n_tokens0 offset when enabling last-token logits in add_sequence#2205
Anai-Guo wants to merge 2 commits intoabetlen:mainfrom
Anai-Guo:fix-add-sequence-logits-index

Anai-Guo commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Anai-Guo commented May 3, 2026

Problem

Root cause

Fix

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants