extension/llm/server: token-ID prompt segments for tool-use resume (V2b.1.5)#20161
Open
mergennachin wants to merge 3 commits into
Open
extension/llm/server: token-ID prompt segments for tool-use resume (V2b.1.5)#20161mergennachin wants to merge 3 commits into
mergennachin wants to merge 3 commits into
Conversation
[ghstack-poisoned]
Contributor
Author
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20161
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New Failures, 1 Cancelled Job, 2 Unclassified FailuresAs of commit a120263 with merge base eeb0646 ( NEW FAILURES - The following jobs have failed:
UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:
CANCELLED JOB - The following job was cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This was referenced Jun 9, 2026
[ghstack-poisoned]
mergennachin
added a commit
that referenced
this pull request
Jun 9, 2026
…2b.1.5)
Warm resume (V2b.1) misses on agent loops: an assistant turn re-rendered from
its parsed tool call almost never re-tokenizes to the tokens the model actually
generated, so the resident state isn't an exact prefix and the worker resets.
On BFCL multi_turn the warm-resume hit rate was 0%.
Fix: carry the exact tokens instead of re-deriving them from text. The worker
returns generated_token_ids on `done` and accepts a `prompt_segments` form of
the prompt -- an ordered list of {"text"} chunks to tokenize and {"ids"} runs of
literal token ids (mutually exclusive with the plain `prompt` string); the
WorkerClient/SessionRuntime transport for that form was introduced with the
SessionRuntime boundary, and this commit makes the worker assemble and emit it.
The adapter-specific transcript glue lives in a new module, openai_transcript.py
(OpenAITranscriptState): it stores one record per assistant turn ({fingerprint,
ids}) and, on the next request, rebuilds the prompt as segments -- each prior
assistant turn is replaced with a unique sentinel, the conversation is rendered
once, and the rendered text is split on the sentinels with the stored ids spliced
back in. Tool results stay text (they re-tokenize deterministically). This logic
is the OpenAI adapter's concern, not the runtime's: SessionRuntime only sees a
PromptInput (text or segments).
Splicing is guarded so stale ids are never injected: a turn is substituted only
when the incoming assistant message fingerprint-matches the response we returned
(an edited or branched history, or a session reused for another conversation ->
text fallback; splicing stops at the first divergence and the now-stale tail is
pruned, and a regenerated turn is recorded at its position so it replaces the
stale record instead of shadowing later hits), and only when its ids faithfully
decode to what the client saw -- a stop-string trim kept post-stop tokens
resident but dropped them from the output, so the worker omits the ids and the
turn is re-rendered as text. Sentinel collisions / dropped sentinels also fall
back to text, and the worker's exact-token prefix check backstops the rest.
The context-window preflight counts what the worker actually assembles: for a
segment prompt it sums the literal {ids} run lengths and the tokenized {text}
chunks (not the rendered string), so a near-limit request agrees with the worker
rather than false-rejecting or failing mid-decode.
On BFCL multi_turn (per-conversation sessions) this moves the reuse fraction
from 0% to ~50% (exact_prefix hits where there were none); the single-turn AST
suite is unchanged (no prior assistant turn -> plain text prompt).
Review order: worker_loop.h (segment assembly + faithful generated_token_ids);
then the control plane (the new openai_transcript.py store + fingerprint-guarded
sentinel rendering, and the serving_chat wiring that builds the segments and
counts them for the context preflight); then tests and docs.
Part of #20001
ghstack-source-id: 8f15d6a
ghstack-comment-id: 4661784137
Pull-Request: #20161
[ghstack-poisoned]
mergennachin
added a commit
that referenced
this pull request
Jun 9, 2026
…2b.1.5)
Warm resume (V2b.1) misses on agent loops: an assistant turn re-rendered from
its parsed tool call almost never re-tokenizes to the tokens the model actually
generated, so the resident state isn't an exact prefix and the worker resets.
On BFCL multi_turn the warm-resume hit rate was 0%.
Fix: carry the exact tokens instead of re-deriving them from text. The worker
returns generated_token_ids on `done` and accepts a `prompt_segments` form of
the prompt -- an ordered list of {"text"} chunks to tokenize and {"ids"} runs of
literal token ids (mutually exclusive with the plain `prompt` string); the
WorkerClient/SessionRuntime transport for that form was introduced with the
SessionRuntime boundary, and this commit makes the worker assemble and emit it.
The adapter-specific transcript glue lives in a new module, openai_transcript.py
(OpenAITranscriptState): it stores one record per assistant turn ({fingerprint,
ids, generation preamble}) and, on the next request, rebuilds the prompt as
segments -- each prior assistant turn is replaced with a unique sentinel, the
conversation is rendered once, and the rendered text is split on the sentinels
with the stored ids spliced back in. Tool results stay text (they re-tokenize
deterministically). This logic is the OpenAI adapter's concern, not the
runtime's: SessionRuntime only sees a PromptInput (text or segments).
The splice also reproduces the deterministic generation scaffold the worker
prefills into resident KV. Qwen3's template appends a scaffold after the
assistant header (no-think: `<think>\n\n</think>\n\n`; thinking: `<think>\n`),
then strips it when re-rendering an assistant turn that precedes the last user
message -- so without this, the resident state carried scaffold tokens the next
prompt lacked and ordinary multi-turn chat reset (only tool-call turns, whose
think the template preserves, ever hit exact_prefix). ChatTemplate.
generation_preamble derives that scaffold for the request's mode; it is recorded
per turn (so a mid-session enable_thinking switch still reproduces each turn's
resident scaffold), and the segment assembly normalizes the scaffold region
before each spliced run to the recorded preamble: it inserts the scaffold where
history stripped it and replaces it where history preserved a different form
(after the last user the template keeps the empty block, which a naive append
would double-insert), falling back to text on an unrecognized region. Ordinary
multi-turn chat now warm-resumes too. This is adapter-only -- no worker, runtime,
or protocol change.
Splicing is guarded so stale ids are never injected: a turn is substituted only
when the incoming assistant message fingerprint-matches the response we returned
(the fingerprint canonicalizes each tool call's JSON arguments before hashing, so
a client that reserializes them with different whitespace or key order -- the
same value -- still matches and resumes, rather than looking like an edited turn;
an edited or branched history, or a session reused for another conversation ->
text fallback; splicing stops at the first divergence and the now-stale tail is
pruned, and a regenerated turn is recorded at its position so it replaces the
stale record instead of shadowing later hits), and only when its ids faithfully
decode to what the client saw -- a stop-string trim kept post-stop tokens
resident but dropped them from the output, so the worker omits the ids and the
turn is re-rendered as text. Sentinel collisions / dropped sentinels also fall
back to text, and the worker's exact-token prefix check backstops the rest.
The context-window preflight counts what the worker actually assembles: for a
segment prompt it sums the literal {ids} run lengths and the tokenized {text}
chunks (not the rendered string), so a near-limit request agrees with the worker
rather than false-rejecting or failing mid-decode.
On BFCL multi_turn (per-conversation sessions) this moves the reuse fraction
from 0% to ~50% (exact_prefix hits where there were none); with the scaffold
reproduction, ordinary multi-turn chat reaches exact_prefix on every append
turn rather than re-prefilling the whole prompt. The single-turn AST suite is
unchanged (no prior assistant turn -> plain text prompt).
Review order: worker_loop.h (segment assembly + faithful generated_token_ids);
then the control plane (the new openai_transcript.py store + fingerprint-guarded
sentinel rendering + per-turn generation-scaffold normalization, chat_template.
generation_preamble, and the serving_chat wiring that builds the segments,
threads the preamble, and counts segments for the context preflight); then tests
and docs.
This change was authored with Claude Code.
Part of #20001
ghstack-source-id: 861cb67
ghstack-comment-id: 4661784137
Pull-Request: #20161
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Warm resume (V2b.1) misses on agent loops: an assistant turn re-rendered from
its parsed tool call almost never re-tokenizes to the tokens the model actually
generated, so the resident state isn't an exact prefix and the worker resets.
On BFCL multi_turn the warm-resume hit rate was 0%.
Fix: carry the exact tokens instead of re-deriving them from text. The worker
returns generated_token_ids on
doneand accepts aprompt_segmentsform ofthe prompt -- an ordered list of {"text"} chunks to tokenize and {"ids"} runs of
literal token ids (mutually exclusive with the plain
promptstring); theWorkerClient/SessionRuntime transport for that form was introduced with the
SessionRuntime boundary, and this commit makes the worker assemble and emit it.
The adapter-specific transcript glue lives in a new module, openai_transcript.py
(OpenAITranscriptState): it stores one record per assistant turn ({fingerprint,
ids}) and, on the next request, rebuilds the prompt as segments -- each prior
assistant turn is replaced with a unique sentinel, the conversation is rendered
once, and the rendered text is split on the sentinels with the stored ids spliced
back in. Tool results stay text (they re-tokenize deterministically). This logic
is the OpenAI adapter's concern, not the runtime's: SessionRuntime only sees a
PromptInput (text or segments).
Splicing is guarded so stale ids are never injected: a turn is substituted only
when the incoming assistant message fingerprint-matches the response we returned
(an edited or branched history, or a session reused for another conversation ->
text fallback; splicing stops at the first divergence and the now-stale tail is
pruned, and a regenerated turn is recorded at its position so it replaces the
stale record instead of shadowing later hits), and only when its ids faithfully
decode to what the client saw -- a stop-string trim kept post-stop tokens
resident but dropped them from the output, so the worker omits the ids and the
turn is re-rendered as text. Sentinel collisions / dropped sentinels also fall
back to text, and the worker's exact-token prefix check backstops the rest.
The context-window preflight counts what the worker actually assembles: for a
segment prompt it sums the literal {ids} run lengths and the tokenized {text}
chunks (not the rendered string), so a near-limit request agrees with the worker
rather than false-rejecting or failing mid-decode.
On BFCL multi_turn (per-conversation sessions) this moves the reuse fraction
from 0% to ~50% (exact_prefix hits where there were none); the single-turn AST
suite is unchanged (no prior assistant turn -> plain text prompt).
Review order: worker_loop.h (segment assembly + faithful generated_token_ids);
then the control plane (the new openai_transcript.py store + fingerprint-guarded
sentinel rendering, and the serving_chat wiring that builds the segments and
counts them for the context preflight); then tests and docs.
Part of #20001