extension/llm/server: token-ID prompt segments for tool-use resume (V2b.1.5) by mergennachin · Pull Request #20161 · pytorch/executorch

mergennachin · 2026-06-09T16:23:19Z

Warm resume (V2b.1) misses on agent loops: an assistant turn re-rendered from
its parsed tool call almost never re-tokenizes to the tokens the model actually
generated, so the resident state isn't an exact prefix and the worker resets.
On BFCL multi_turn the warm-resume hit rate was 0%.

Fix: carry the exact tokens instead of re-deriving them from text. The worker
returns generated_token_ids on done and accepts a prompt_segments form of
the prompt -- an ordered list of {"text"} chunks to tokenize and {"ids"} runs of
literal token ids (mutually exclusive with the plain prompt string); the
WorkerClient/SessionRuntime transport for that form was introduced with the
SessionRuntime boundary, and this commit makes the worker assemble and emit it.
The adapter-specific transcript glue lives in a new module, openai_transcript.py
(OpenAITranscriptState): it stores one record per assistant turn ({fingerprint,
ids}) and, on the next request, rebuilds the prompt as segments -- each prior
assistant turn is replaced with a unique sentinel, the conversation is rendered
once, and the rendered text is split on the sentinels with the stored ids spliced
back in. Tool results stay text (they re-tokenize deterministically). This logic
is the OpenAI adapter's concern, not the runtime's: SessionRuntime only sees a
PromptInput (text or segments).

Splicing is guarded so stale ids are never injected: a turn is substituted only
when the incoming assistant message fingerprint-matches the response we returned
(an edited or branched history, or a session reused for another conversation ->
text fallback; splicing stops at the first divergence and the now-stale tail is
pruned, and a regenerated turn is recorded at its position so it replaces the
stale record instead of shadowing later hits), and only when its ids faithfully
decode to what the client saw -- a stop-string trim kept post-stop tokens
resident but dropped them from the output, so the worker omits the ids and the
turn is re-rendered as text. Sentinel collisions / dropped sentinels also fall
back to text, and the worker's exact-token prefix check backstops the rest.

The context-window preflight counts what the worker actually assembles: for a
segment prompt it sums the literal {ids} run lengths and the tokenized {text}
chunks (not the rendered string), so a near-limit request agrees with the worker
rather than false-rejecting or failing mid-decode.

On BFCL multi_turn (per-conversation sessions) this moves the reuse fraction
from 0% to ~50% (exact_prefix hits where there were none); the single-turn AST
suite is unchanged (no prior assistant turn -> plain text prompt).

Review order: worker_loop.h (segment assembly + faithful generated_token_ids);
then the control plane (the new openai_transcript.py store + fingerprint-guarded
sentinel rendering, and the serving_chat wiring that builds the segments and
counts them for the context preflight); then tests and docs.

Part of #20001

[ghstack-poisoned]

mergennachin · 2026-06-09T16:23:20Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-06-09T16:23:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20161

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Cancelled Job, 2 Unclassified Failures

As of commit a120263 with merge base eeb0646 ():

NEW FAILURES - The following jobs have failed:

pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t 3fd5c072632a7ed640f8068a1bdd6c4452ced799e5534b55d618b40bf9e31508 /exec failed with exit code 1
pull / unittest / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 1ce5288a847fb4f855ccc5df7dec145f534a812b70dd48af67cff3d19d2af347 /exec failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Build Windows Wheels / pytorch/executorch / build-wheel-py3_10-cpu (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Process completed with exit code 1.
Build Windows Wheels / pytorch/executorch / upload / upload-wheel-py3_10-cpu (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Unable to download artifact(s): Artifact not found for name: pytorch_executorch__3.10_cpu_x64

CANCELLED JOB - The following job was cancelled. Please retry:

Test Metal Backend / test-model-metal-e2e (nvidia, parakeet-tdt, quantized-int4-metal) / macos-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

…2b.1.5) Warm resume (V2b.1) misses on agent loops: an assistant turn re-rendered from its parsed tool call almost never re-tokenizes to the tokens the model actually generated, so the resident state isn't an exact prefix and the worker resets. On BFCL multi_turn the warm-resume hit rate was 0%. Fix: carry the exact tokens instead of re-deriving them from text. The worker returns generated_token_ids on `done` and accepts a `prompt_segments` form of the prompt -- an ordered list of {"text"} chunks to tokenize and {"ids"} runs of literal token ids (mutually exclusive with the plain `prompt` string); the WorkerClient/SessionRuntime transport for that form was introduced with the SessionRuntime boundary, and this commit makes the worker assemble and emit it. The adapter-specific transcript glue lives in a new module, openai_transcript.py (OpenAITranscriptState): it stores one record per assistant turn ({fingerprint, ids}) and, on the next request, rebuilds the prompt as segments -- each prior assistant turn is replaced with a unique sentinel, the conversation is rendered once, and the rendered text is split on the sentinels with the stored ids spliced back in. Tool results stay text (they re-tokenize deterministically). This logic is the OpenAI adapter's concern, not the runtime's: SessionRuntime only sees a PromptInput (text or segments). Splicing is guarded so stale ids are never injected: a turn is substituted only when the incoming assistant message fingerprint-matches the response we returned (an edited or branched history, or a session reused for another conversation -> text fallback; splicing stops at the first divergence and the now-stale tail is pruned, and a regenerated turn is recorded at its position so it replaces the stale record instead of shadowing later hits), and only when its ids faithfully decode to what the client saw -- a stop-string trim kept post-stop tokens resident but dropped them from the output, so the worker omits the ids and the turn is re-rendered as text. Sentinel collisions / dropped sentinels also fall back to text, and the worker's exact-token prefix check backstops the rest. The context-window preflight counts what the worker actually assembles: for a segment prompt it sums the literal {ids} run lengths and the tokenized {text} chunks (not the rendered string), so a near-limit request agrees with the worker rather than false-rejecting or failing mid-decode. On BFCL multi_turn (per-conversation sessions) this moves the reuse fraction from 0% to ~50% (exact_prefix hits where there were none); the single-turn AST suite is unchanged (no prior assistant turn -> plain text prompt). Review order: worker_loop.h (segment assembly + faithful generated_token_ids); then the control plane (the new openai_transcript.py store + fingerprint-guarded sentinel rendering, and the serving_chat wiring that builds the segments and counts them for the context preflight); then tests and docs. Part of #20001 ghstack-source-id: 8f15d6a ghstack-comment-id: 4661784137 Pull-Request: #20161

[ghstack-poisoned]

…2b.1.5) Warm resume (V2b.1) misses on agent loops: an assistant turn re-rendered from its parsed tool call almost never re-tokenizes to the tokens the model actually generated, so the resident state isn't an exact prefix and the worker resets. On BFCL multi_turn the warm-resume hit rate was 0%. Fix: carry the exact tokens instead of re-deriving them from text. The worker returns generated_token_ids on `done` and accepts a `prompt_segments` form of the prompt -- an ordered list of {"text"} chunks to tokenize and {"ids"} runs of literal token ids (mutually exclusive with the plain `prompt` string); the WorkerClient/SessionRuntime transport for that form was introduced with the SessionRuntime boundary, and this commit makes the worker assemble and emit it. The adapter-specific transcript glue lives in a new module, openai_transcript.py (OpenAITranscriptState): it stores one record per assistant turn ({fingerprint, ids, generation preamble}) and, on the next request, rebuilds the prompt as segments -- each prior assistant turn is replaced with a unique sentinel, the conversation is rendered once, and the rendered text is split on the sentinels with the stored ids spliced back in. Tool results stay text (they re-tokenize deterministically). This logic is the OpenAI adapter's concern, not the runtime's: SessionRuntime only sees a PromptInput (text or segments). The splice also reproduces the deterministic generation scaffold the worker prefills into resident KV. Qwen3's template appends a scaffold after the assistant header (no-think: `<think>\n\n</think>\n\n`; thinking: `<think>\n`), then strips it when re-rendering an assistant turn that precedes the last user message -- so without this, the resident state carried scaffold tokens the next prompt lacked and ordinary multi-turn chat reset (only tool-call turns, whose think the template preserves, ever hit exact_prefix). ChatTemplate. generation_preamble derives that scaffold for the request's mode; it is recorded per turn (so a mid-session enable_thinking switch still reproduces each turn's resident scaffold), and the segment assembly normalizes the scaffold region before each spliced run to the recorded preamble: it inserts the scaffold where history stripped it and replaces it where history preserved a different form (after the last user the template keeps the empty block, which a naive append would double-insert), falling back to text on an unrecognized region. Ordinary multi-turn chat now warm-resumes too. This is adapter-only -- no worker, runtime, or protocol change. Splicing is guarded so stale ids are never injected: a turn is substituted only when the incoming assistant message fingerprint-matches the response we returned (the fingerprint canonicalizes each tool call's JSON arguments before hashing, so a client that reserializes them with different whitespace or key order -- the same value -- still matches and resumes, rather than looking like an edited turn; an edited or branched history, or a session reused for another conversation -> text fallback; splicing stops at the first divergence and the now-stale tail is pruned, and a regenerated turn is recorded at its position so it replaces the stale record instead of shadowing later hits), and only when its ids faithfully decode to what the client saw -- a stop-string trim kept post-stop tokens resident but dropped them from the output, so the worker omits the ids and the turn is re-rendered as text. Sentinel collisions / dropped sentinels also fall back to text, and the worker's exact-token prefix check backstops the rest. The context-window preflight counts what the worker actually assembles: for a segment prompt it sums the literal {ids} run lengths and the tokenized {text} chunks (not the rendered string), so a near-limit request agrees with the worker rather than false-rejecting or failing mid-decode. On BFCL multi_turn (per-conversation sessions) this moves the reuse fraction from 0% to ~50% (exact_prefix hits where there were none); with the scaffold reproduction, ordinary multi-turn chat reaches exact_prefix on every append turn rather than re-prefilling the whole prompt. The single-turn AST suite is unchanged (no prior assistant turn -> plain text prompt). Review order: worker_loop.h (segment assembly + faithful generated_token_ids); then the control plane (the new openai_transcript.py store + fingerprint-guarded sentinel rendering + per-turn generation-scaffold normalization, chat_template. generation_preamble, and the serving_chat wiring that builds the segments, threads the preamble, and counts segments for the context preflight); then tests and docs. This change was authored with Claude Code. Part of #20001 ghstack-source-id: 861cb67 ghstack-comment-id: 4661784137 Pull-Request: #20161

[INITIAL] Update

a4851cf

[ghstack-poisoned]

mergennachin requested a review from larryliu0820 as a code owner June 9, 2026 16:23

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2026

[UPDATE] Update

0350bfc

[ghstack-poisoned]

[UPDATE] Update

a120263

[ghstack-poisoned]

mergennachin requested a review from kirklandsign as a code owner June 9, 2026 22:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extension/llm/server: token-ID prompt segments for tool-use resume (V2b.1.5)#20161

extension/llm/server: token-ID prompt segments for tool-use resume (V2b.1.5)#20161
mergennachin wants to merge 3 commits into
gh/mergennachin/10/headfrom
gh/mergennachin/11/head

mergennachin commented Jun 9, 2026

Uh oh!

mergennachin commented Jun 9, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mergennachin commented Jun 9, 2026

Uh oh!

mergennachin commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20161

❌ 4 New Failures, 1 Cancelled Job, 2 Unclassified Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mergennachin commented Jun 9, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading