Skip to content

extension/llm/server: serving docs and comment cleanup#20193

Open
mergennachin wants to merge 2 commits into
gh/mergennachin/11/headfrom
gh/mergennachin/12/head
Open

extension/llm/server: serving docs and comment cleanup#20193
mergennachin wants to merge 2 commits into
gh/mergennachin/11/headfrom
gh/mergennachin/12/head

Conversation

@mergennachin

Copy link
Copy Markdown
Contributor

Documentation/comment-only hardening (no control-flow change), in two parts.

First, document the limitations that matter for local-agent / subagent use, which
were previously only implied: cancellation is best-effort and head-of-line
blocking (WorkerClient.stop() is a no-op and the worker holds the single
in-flight slot to completion, so a disconnected client doesn't interrupt it and a
long generation blocks other sessions until it finishes; real interruption needs
a protocol change), and warm resume requires true turn terminators surfaced as
terminal/EOS token ids -- a string-only terminator marks every turn dirty and
never resumes. Stated in worker_loop.h, python/README.md, spec/README.md, and the
WorkerClient.stop / SessionRuntime cancel-path comments.

Second, make the stack read as a stable architecture rather than a migration
diary: removes the intermediate phase labels (V1 / V2 / V2a / V2b.1 / V2b.1.5 and
work-item tags) from all serving and Qwen-worker-example comments, docstrings,
READMEs, and tests; shortens the worker_loop.h top narrative and the
openai_transcript.py module and helper docstrings to their durable contracts;
tightens the hot-path splice and stop-handling comments; de-duplicates the JSONL
protocol (cpp/worker_loop.h is the canonical reference, with worker_client.py and
the READMEs pointing to it) and replaces the stale protocol snippet in
python/README.md; clarifies prefix/KV reuse in spec/README.md (no global
cross-session prefix cache, but per-session append-only warm resume is
implemented worker-side); and trims the Qwen README session section to
user-facing facts.

Kept: the JSONL/wire protocol contract, the exact-token warm-resume invariant
(mismatch resets), stop-string-trim non-resumability, generated_token_ids
excluding the terminal EOS, the resident_token_ids == session.position()
invariant, the CUDA mutable-state rationale, and the user-visible cancellation /
head-of-line and terminator-vs-stop limitations.

Behavior-preserving: the full Python serving suite passes; the only non-comment
edits are two diagnostic strings (an error message and a CLI help description).

Part of #20001

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20193

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Pending, 4 Unclassified Failures

As of commit ad34283 with merge base eeb0646 (image):

NEW FAILURE - The following job has failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
mergennachin added a commit that referenced this pull request Jun 10, 2026
Documentation/comment-only hardening (no control-flow change), in two parts.

First, document the limitations that matter for local-agent / subagent use, which
were previously only implied: cancellation is best-effort and head-of-line
blocking (WorkerClient.stop() is a no-op and the worker holds the single
in-flight slot to completion, so a disconnected client doesn't interrupt it and a
long generation blocks other sessions until it finishes; real interruption needs
a protocol change), and warm resume requires true turn terminators surfaced as
terminal/EOS token ids -- a string-only terminator marks every turn dirty and
never resumes. Stated in worker_loop.h, python/README.md, spec/README.md, and the
WorkerClient.stop / SessionRuntime cancel-path comments.

Second, make the stack read as a stable architecture rather than a migration
diary: removes the intermediate phase labels (V1 / V2 / V2a / V2b.1 / V2b.1.5 and
work-item tags) from all serving and Qwen-worker-example comments, docstrings,
READMEs, and tests; shortens the worker_loop.h top narrative and the
openai_transcript.py module and helper docstrings to their durable contracts;
tightens the hot-path splice and stop-handling comments; de-duplicates the JSONL
protocol (cpp/worker_loop.h is the canonical reference, with worker_client.py and
the READMEs pointing to it) and replaces the stale protocol snippet in
python/README.md; clarifies prefix/KV reuse in spec/README.md (no global
cross-session prefix cache, but per-session append-only warm resume is
implemented worker-side); and trims the Qwen README session section to
user-facing facts.

Kept: the JSONL/wire protocol contract, the exact-token warm-resume invariant
(mismatch resets), stop-string-trim non-resumability, generated_token_ids
excluding the terminal EOS, the resident_token_ids == session.position()
invariant, the CUDA mutable-state rationale, and the user-visible cancellation /
head-of-line and terminator-vs-stop limitations.

Behavior-preserving: the full Python serving suite passes; the only non-comment
edits are two diagnostic strings (an error message and a CLI help description).

Part of #20001

ghstack-source-id: f98f407
ghstack-comment-id: 4672992038
Pull-Request: #20193
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant