extension/llm/server: serving docs and comment cleanup#20193
Open
mergennachin wants to merge 2 commits into
Open
extension/llm/server: serving docs and comment cleanup#20193mergennachin wants to merge 2 commits into
mergennachin wants to merge 2 commits into
Conversation
[ghstack-poisoned]
Contributor
Author
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20193
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Pending, 4 Unclassified FailuresAs of commit ad34283 with merge base eeb0646 ( NEW FAILURE - The following job has failed:
UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This was referenced Jun 10, 2026
[ghstack-poisoned]
mergennachin
added a commit
that referenced
this pull request
Jun 10, 2026
Documentation/comment-only hardening (no control-flow change), in two parts. First, document the limitations that matter for local-agent / subagent use, which were previously only implied: cancellation is best-effort and head-of-line blocking (WorkerClient.stop() is a no-op and the worker holds the single in-flight slot to completion, so a disconnected client doesn't interrupt it and a long generation blocks other sessions until it finishes; real interruption needs a protocol change), and warm resume requires true turn terminators surfaced as terminal/EOS token ids -- a string-only terminator marks every turn dirty and never resumes. Stated in worker_loop.h, python/README.md, spec/README.md, and the WorkerClient.stop / SessionRuntime cancel-path comments. Second, make the stack read as a stable architecture rather than a migration diary: removes the intermediate phase labels (V1 / V2 / V2a / V2b.1 / V2b.1.5 and work-item tags) from all serving and Qwen-worker-example comments, docstrings, READMEs, and tests; shortens the worker_loop.h top narrative and the openai_transcript.py module and helper docstrings to their durable contracts; tightens the hot-path splice and stop-handling comments; de-duplicates the JSONL protocol (cpp/worker_loop.h is the canonical reference, with worker_client.py and the READMEs pointing to it) and replaces the stale protocol snippet in python/README.md; clarifies prefix/KV reuse in spec/README.md (no global cross-session prefix cache, but per-session append-only warm resume is implemented worker-side); and trims the Qwen README session section to user-facing facts. Kept: the JSONL/wire protocol contract, the exact-token warm-resume invariant (mismatch resets), stop-string-trim non-resumability, generated_token_ids excluding the terminal EOS, the resident_token_ids == session.position() invariant, the CUDA mutable-state rationale, and the user-visible cancellation / head-of-line and terminator-vs-stop limitations. Behavior-preserving: the full Python serving suite passes; the only non-comment edits are two diagnostic strings (an error message and a CLI help description). Part of #20001 ghstack-source-id: f98f407 ghstack-comment-id: 4672992038 Pull-Request: #20193
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Documentation/comment-only hardening (no control-flow change), in two parts.
First, document the limitations that matter for local-agent / subagent use, which
were previously only implied: cancellation is best-effort and head-of-line
blocking (WorkerClient.stop() is a no-op and the worker holds the single
in-flight slot to completion, so a disconnected client doesn't interrupt it and a
long generation blocks other sessions until it finishes; real interruption needs
a protocol change), and warm resume requires true turn terminators surfaced as
terminal/EOS token ids -- a string-only terminator marks every turn dirty and
never resumes. Stated in worker_loop.h, python/README.md, spec/README.md, and the
WorkerClient.stop / SessionRuntime cancel-path comments.
Second, make the stack read as a stable architecture rather than a migration
diary: removes the intermediate phase labels (V1 / V2 / V2a / V2b.1 / V2b.1.5 and
work-item tags) from all serving and Qwen-worker-example comments, docstrings,
READMEs, and tests; shortens the worker_loop.h top narrative and the
openai_transcript.py module and helper docstrings to their durable contracts;
tightens the hot-path splice and stop-handling comments; de-duplicates the JSONL
protocol (cpp/worker_loop.h is the canonical reference, with worker_client.py and
the READMEs pointing to it) and replaces the stale protocol snippet in
python/README.md; clarifies prefix/KV reuse in spec/README.md (no global
cross-session prefix cache, but per-session append-only warm resume is
implemented worker-side); and trims the Qwen README session section to
user-facing facts.
Kept: the JSONL/wire protocol contract, the exact-token warm-resume invariant
(mismatch resets), stop-string-trim non-resumability, generated_token_ids
excluding the terminal EOS, the resident_token_ids == session.position()
invariant, the CUDA mutable-state rationale, and the user-visible cancellation /
head-of-line and terminator-vs-stop limitations.
Behavior-preserving: the full Python serving suite passes; the only non-comment
edits are two diagnostic strings (an error message and a CLI help description).
Part of #20001