extension/llm/server: serving docs and comment cleanup by mergennachin · Pull Request #20193 · pytorch/executorch

mergennachin · 2026-06-10T18:03:06Z

Documentation/comment-only hardening (no control-flow change), in two parts.

First, document the limitations that matter for local-agent / subagent use, which
were previously only implied: cancellation is best-effort and head-of-line
blocking (WorkerClient.stop() is a no-op and the worker holds the single
in-flight slot to completion, so a disconnected client doesn't interrupt it and a
long generation blocks other sessions until it finishes; real interruption needs
a protocol change), and warm resume requires true turn terminators surfaced as
terminal/EOS token ids -- a string-only terminator marks every turn dirty and
never resumes. Stated in worker_loop.h, python/README.md, spec/README.md, and the
WorkerClient.stop / SessionRuntime cancel-path comments.

Second, make the stack read as a stable architecture rather than a migration
diary: removes the intermediate phase labels (V1 / V2 / V2a / V2b.1 / V2b.1.5 and
work-item tags) from all serving and Qwen-worker-example comments, docstrings,
READMEs, and tests; shortens the worker_loop.h top narrative and the
openai_transcript.py module and helper docstrings to their durable contracts;
tightens the hot-path splice and stop-handling comments; de-duplicates the JSONL
protocol (cpp/worker_loop.h is the canonical reference, with worker_client.py and
the READMEs pointing to it) and replaces the stale protocol snippet in
python/README.md; clarifies prefix/KV reuse in spec/README.md (no global
cross-session prefix cache, but per-session append-only warm resume is
implemented worker-side); and trims the Qwen README session section to
user-facing facts.

Kept: the JSONL/wire protocol contract, the exact-token warm-resume invariant
(mismatch resets), stop-string-trim non-resumability, generated_token_ids
excluding the terminal EOS, the resident_token_ids == session.position()
invariant, the CUDA mutable-state rationale, and the user-visible cancellation /
head-of-line and terminator-vs-stop limitations.

Behavior-preserving: the full Python serving suite passes; the only non-comment
edits are two diagnostic strings (an error message and a CLI help description).

Part of #20001

[ghstack-poisoned]

mergennachin · 2026-06-10T18:03:07Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-06-10T18:03:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20193

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Pending, 4 Unclassified Failures

As of commit ad34283 with merge base eeb0646 ():

NEW FAILURE - The following job has failed:

MLX / test-mlx-llm (unsloth/gemma-3-1b-it, gemma3-1b, true, nvfp4, macos-14-xlarge) / test-mlx-llm-gemma3-1b-custom-nvfp4 (gh)
The process '/opt/homebrew/bin/git' failed with exit code 1

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Build Windows Wheels / pytorch/executorch / build-wheel-py3_10-cpu (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Process completed with exit code 1.
Build Windows Wheels / pytorch/executorch / upload / upload-wheel-py3_10-cpu (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Unable to download artifact(s): Artifact not found for name: pytorch_executorch__3.10_cpu_x64
pull (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Test CoreML Backend (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

Documentation/comment-only hardening (no control-flow change), in two parts. First, document the limitations that matter for local-agent / subagent use, which were previously only implied: cancellation is best-effort and head-of-line blocking (WorkerClient.stop() is a no-op and the worker holds the single in-flight slot to completion, so a disconnected client doesn't interrupt it and a long generation blocks other sessions until it finishes; real interruption needs a protocol change), and warm resume requires true turn terminators surfaced as terminal/EOS token ids -- a string-only terminator marks every turn dirty and never resumes. Stated in worker_loop.h, python/README.md, spec/README.md, and the WorkerClient.stop / SessionRuntime cancel-path comments. Second, make the stack read as a stable architecture rather than a migration diary: removes the intermediate phase labels (V1 / V2 / V2a / V2b.1 / V2b.1.5 and work-item tags) from all serving and Qwen-worker-example comments, docstrings, READMEs, and tests; shortens the worker_loop.h top narrative and the openai_transcript.py module and helper docstrings to their durable contracts; tightens the hot-path splice and stop-handling comments; de-duplicates the JSONL protocol (cpp/worker_loop.h is the canonical reference, with worker_client.py and the READMEs pointing to it) and replaces the stale protocol snippet in python/README.md; clarifies prefix/KV reuse in spec/README.md (no global cross-session prefix cache, but per-session append-only warm resume is implemented worker-side); and trims the Qwen README session section to user-facing facts. Kept: the JSONL/wire protocol contract, the exact-token warm-resume invariant (mismatch resets), stop-string-trim non-resumability, generated_token_ids excluding the terminal EOS, the resident_token_ids == session.position() invariant, the CUDA mutable-state rationale, and the user-visible cancellation / head-of-line and terminator-vs-stop limitations. Behavior-preserving: the full Python serving suite passes; the only non-comment edits are two diagnostic strings (an error message and a CLI help description). Part of #20001 ghstack-source-id: f98f407 ghstack-comment-id: 4672992038 Pull-Request: #20193

[INITIAL] Update

22dd42e

[ghstack-poisoned]

mergennachin requested a review from larryliu0820 as a code owner June 10, 2026 18:03

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 10, 2026

[UPDATE] Update

ad34283

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extension/llm/server: serving docs and comment cleanup#20193

extension/llm/server: serving docs and comment cleanup#20193
mergennachin wants to merge 2 commits into
gh/mergennachin/11/headfrom
gh/mergennachin/12/head

mergennachin commented Jun 10, 2026

Uh oh!

mergennachin commented Jun 10, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mergennachin commented Jun 10, 2026

Uh oh!

mergennachin commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20193

❌ 1 New Failure, 1 Pending, 4 Unclassified Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mergennachin commented Jun 10, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 10, 2026 •

edited

Loading