extension/llm/server: worker-based OpenAI-compatible HTTP server by mergennachin · Pull Request #19994 · pytorch/executorch

mergennachin · 2026-06-03T22:32:43Z

Wire the foundations into a FastAPI app: /v1/chat/completions (streaming and
non-streaming), /v1/models, /health. Request validation rejects parameters the
server can't honor (top_p != 1, seed, n > 1, frequency/presence penalties,
top_k, logit_bias, logprobs, response_format other than text, non-positive
max_tokens, tool_choice = required / specific function) instead of silently
ignoring them; stop sequences are applied before tool parsing; usage is reported.

The Python process is control plane only: it loads no model and imports no
runtime pybind. Model execution runs in a separate C++ worker process
(cpp/text_llm_worker.cpp, over TextLLMEngine/TextLLMSession) that the control
plane spawns and drives over a small JSONL protocol (worker_client.py). The
protocol and the decode loop (reset, encode, context clamp, prefill, decode,
UTF-8 assembly, stop handling, stats, finish_reason) live in a shared header,
cpp/worker_loop.h, so model-specific workers reuse them; text_llm_worker only
constructs the engine/session and runs the loop.

The Python execution boundary is ServingChat -> SessionRuntime -> WorkerClient
-> C++ worker. ServingChat is a thin OpenAI adapter (protocol, templating, tool
parsing, streaming/SSE). SessionRuntime is the stateful runtime over a single
WorkerClient: it serializes the worker (one in-flight request) and bridges the
worker's blocking generate() into an async token stream. WorkerClient is raw
JSONL transport. There is no RunnerPool and no multi-worker scheduling/affinity
in this milestone; concurrent requests queue.

SessionRuntime is introduced here as the stable control-plane boundary for the
rest of the stack: its method/field surface (session_id routing, reset, warm-
resume stats on GenStats, token-ID prompt_segments on PromptInput/_WorkerRequest)
is defined once, but the behavior and tests that activate those features land in
their natural later commits -- named-session routing/admission (V2a), warm
append-only resume (V2b.1), and token-ID prompt segments (V2b.1.5). This keeps
the boundary stable for whole-stack review instead of re-shaping it every commit.

There is no prefix cache and no Python-side KV state; cancellation is
best-effort (the control plane stops consuming, the worker finishes the
in-flight request). Hermetic tests (a FakeRunner worker) cover the contract,
templating, sampling params, tool calls, the runtime, and the worker protocol;
conformance/ is a black-box suite runnable against any live OpenAI server.
READMEs document the flags and scope.

Depends on the serving foundations.

Part of #20001

[ghstack-poisoned]

mergennachin · 2026-06-03T22:32:44Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-06-03T22:32:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19994

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Unrelated Failures

As of commit 1703518 with merge base f0dff03 ():

NEW FAILURES - The following jobs have failed:

pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t 7b061a330021a0c23048e4b71c5e2bcbf065a09f0043a686b1ac0183982a9fc4 /exec failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / unittest / macos / macos-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / unittest-editable / linux / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / android / build-android (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

Wire the foundations into a FastAPI app: /v1/chat/completions (streaming and non-streaming), /v1/models, /health. Request validation rejects parameters the server can't honor (top_p != 1, seed, n > 1, frequency/presence penalties, top_k, logit_bias, logprobs, response_format other than text, non-positive max_tokens, tool_choice = required / specific function) instead of silently ignoring them; stop sequences are applied before tool parsing; client cancellation calls runner.stop(); usage is reported. runner_pool admits physical sessions per the engine's serving_capacity() (single-slot on XNNPACK, with concurrent requests queueing on the resident session) and routes by prefix affinity. Hermetic tests (FakeRunner via dependency injection) cover the contract, templating, sampling params, tool calls and the pool; conformance/ is a black-box suite runnable against any live OpenAI server. READMEs document the flags and scope. Last of four stacked commits; depends on the bindings and serving foundations. ghstack-source-id: acef8e6 ghstack-comment-id: 4617263008 Pull-Request: #19994

[ghstack-poisoned]

[INITIAL] Update

b644ddd

[ghstack-poisoned]

mergennachin requested a review from larryliu0820 as a code owner June 3, 2026 22:32

This was referenced Jun 3, 2026

extension/llm/runner: Engine/Session C++ core + token-step primitives #19991

Open

extension/llm/server: serving foundations (schemas, errors, templating, tools) #19993

Open

extension/llm/runner: Python bindings for the Engine/Session API #19992

Closed

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 3, 2026

mergennachin requested review from Gasoonjia, GregoryComer, digantdesai, kirklandsign and psiddh June 3, 2026 22:35

[UPDATE] Update

bc4cdc4

[ghstack-poisoned]

mergennachin mentioned this pull request Jun 3, 2026

extension/llm/server: document pi integration #19999

Open

mergennachin added 3 commits June 3, 2026 16:14

[UPDATE] Update

c537c76

[ghstack-poisoned]

[UPDATE] Update

a2a707c

[ghstack-poisoned]

[UPDATE] Update

d2ed65f

[ghstack-poisoned]

mergennachin mentioned this pull request Jun 4, 2026

examples/models/qwen3_5_moe: CUDA Engine/Session adapter + OpenAI serving #20043

Open

mergennachin marked this pull request as draft June 4, 2026 18:51

mergennachin added 3 commits June 4, 2026 15:14

[UPDATE] Update

6777e50

[ghstack-poisoned]

[UPDATE] Update

22e0fdb

[ghstack-poisoned]

[UPDATE] Update

921d819

[ghstack-poisoned]

mergennachin marked this pull request as ready for review June 5, 2026 18:59

[UPDATE] Update

2dae19c

[ghstack-poisoned]

mergennachin mentioned this pull request Jun 8, 2026

Qwen3.5-MoE CUDA V2 foundation: one model, many isolated sessions #20117

Open

mergennachin changed the title ~~extension/llm/server: OpenAI-compatible HTTP server~~ extension/llm/server: worker-based OpenAI-compatible HTTP server Jun 8, 2026

[UPDATE] Update

b433c1b

[ghstack-poisoned]

mergennachin mentioned this pull request Jun 9, 2026

extension/llm/server: token-ID prompt segments for tool-use resume (V2b.1.5) #20161

Open

This was referenced Jun 9, 2026

extension/llm/server: isolated multi-session serving (V2a) #20159

Open

extension/llm/server: warm append-only session resume (V2b.1) #20160

Open

[UPDATE] Update

0b7e448

[ghstack-poisoned]

mergennachin mentioned this pull request Jun 10, 2026

extension/llm/server: serving docs and comment cleanup #20193

Open

[UPDATE] Update

1703518

[ghstack-poisoned]

mergennachin mentioned this pull request Jun 10, 2026

examples/models/gemma4_31b: CUDA Engine/Session adapter + OpenAI serving #20207

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extension/llm/server: worker-based OpenAI-compatible HTTP server#19994

extension/llm/server: worker-based OpenAI-compatible HTTP server#19994
mergennachin wants to merge 12 commits into
gh/mergennachin/4/headfrom
gh/mergennachin/5/head

mergennachin commented Jun 3, 2026 •

edited

Loading

Uh oh!

mergennachin commented Jun 3, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mergennachin commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergennachin commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19994

❌ 2 New Failures, 3 Unrelated Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mergennachin commented Jun 3, 2026 •

edited

Loading

mergennachin commented Jun 3, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 3, 2026 •

edited

Loading