From e773487a6ce2d29d93bf9d11c4b0a49a5de6dd04 Mon Sep 17 00:00:00 2001 From: Jonathan Irvin Date: Sun, 3 May 2026 22:10:34 -0500 Subject: [PATCH] docs(CLAUDE.md): refresh LLM provider routing notes for new src/llm/ architecture MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The April-16 gotchas section was stale after the upstream sync (PR #4): - Legacy 'cf' / 'custom' provider tags removed — replaced by ModelTransport literal (anthropic/openai/gemini) and per-component MODEL_CONFIG__* env vars - 'deriver/summary must stay on cf' rule no longer applies — native gemini backend (src/llm/backends/gemini.py) honors response_format=json_schema - thoughtSignature multi-iteration workaround obsolete — preserved natively in src/llm/history_adapters.py + src/llm/executor.py - LM Studio section: env var names switched from LLM_OPENAI_COMPATIBLE_* globals to MODEL_CONFIG__OVERRIDES__BASE_URL / __API_KEY per component Adds a note that CF Gateway integration is now app-level (cf-aig-authorization auto-injected by src/llm/registry.py and src/embedding_client.py based on base_url pattern matching) rather than deployment-level URL routing. Ollama Cloud structured-output limitation kept — that's still a real upstream constraint, just rephrased for the new transport model. --- CLAUDE.md | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 27cde420..f96b9a89 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -84,21 +84,25 @@ All API routes follow the pattern: `/v1/{resource}/{id}/{action}` - Typechecking: `uv run basedpyright` - Format code: `uv run ruff format src/` -### LLM provider gotchas (learned 2026-04-16 in k8s deploy) +### LLM provider routing (current as of 2026-05-04 upstream sync) -- **Structured outputs (`response_format={"type": "json_schema"}`) only work on providers whose upstream API natively honors them.** Google Gemini does (route via `cf` provider with base_url ending in `/openai`). Ollama Cloud (reached via the `custom` provider + `custom-ollama` CF gateway endpoint, or any direct Ollama endpoint) does **not** translate `response_format` into Ollama's native JSON-mode — every Ollama Cloud model (GLM-5.1, nemotron-3-nano, qwen3.5, devstral-small-2 confirmed) returns free-form text/markdown when a schema is requested, and `honcho_llm_call` bubbles a `ValidationError: Invalid JSON` out of pydantic parsing. -- **Therefore: deriver (`src/deriver/deriver.py:126`) and summary (`src/utils/summarizer.py`) must stay on a Gemini-backed `cf` provider.** Dream, dialectic, and any free-form / tool-call path is free to use the `custom` provider. -- **Gemini `thoughtSignature` round-tripping breaks on the CF `openai`-compat route.** Any call with `maxToolIterations > 1` AND `thinkingBudgetTokens > 0` will return `400 Function call is missing a thought_signature` on iteration 2+. If you need thinking on a multi-iteration tool loop, use the native Gemini provider, not the OpenAI-compat route — or set `thinkingBudgetTokens=0`. -- **None of this is Cloudflare's fault.** CF AI Gateway is a transparent proxy in both the `openai` and `custom-ollama` routes. The limitations live at the upstream provider (Ollama Cloud's OpenAI-compat layer). +The legacy `cf` and `custom` provider tags are gone. Transport is `Literal["anthropic", "openai", "gemini"]` only — see `src/llm/registry.py`. Per-component routing happens via `_MODEL_CONFIG__*` env vars (Pydantic settings with `env_nested_delimiter="__"`). + +- **CF Gateway integration is app-level now**, not deployment-level. `src/llm/registry.py` and `src/embedding_client.py` auto-inject `cf-aig-authorization: Bearer $LLM_CF_GATEWAY_AUTH_TOKEN` on any override client whose `base_url` contains `gateway.ai.cloudflare.com`. Set `LLM_CF_GATEWAY_AUTH_TOKEN` once globally; the rest is per-component `OVERRIDES__BASE_URL`. +- **Native Gemini works for json_schema.** The new `GeminiBackend` (`src/llm/backends/gemini.py`) talks Gemini's native protocol — `response_format=json_schema` is honored server-side. Route through CF Gateway with `base_url: https://gateway.ai.cloudflare.com/v1///google-ai-studio` (note: NO `/openai` suffix — that path was the old OpenAI-compat shim that silently dropped json_schema, deriver/summary used to need workarounds for it). +- **Native Gemini also fixes `thoughtSignature` round-tripping** — `src/llm/history_adapters.py:77-78` and `src/llm/executor.py:43-44` preserve it across tool iterations. The old "set `thinkingBudgetTokens=0` for multi-iter tool loops" workaround is no longer needed. +- **Ollama Cloud routing**: `transport: openai` + `base_url: https://gateway.ai.cloudflare.com/v1///custom-ollama`. Pass the Ollama Cloud key via `MODEL_CONFIG__OVERRIDES__API_KEY_ENV: ` so the secret is referenced not duplicated. Note that `_uses_max_completion_tokens()` in `src/llm/backends/openai.py:21` only fires for gpt-5/o-series models — Ollama Cloud chat models stay on `max_tokens`. +- **`response_format=json_schema` still doesn't work over Ollama Cloud's OpenAI-compat layer.** Free-form / tool-call paths are fine; structured-output paths must use a transport whose upstream honors schemas (anthropic, openai/gpt-5+, or gemini-native). +- **CF AI Gateway** remains a transparent proxy. Limitations are upstream-side; the `cf-aig-authorization` header is the only CF-specific concern in app code. ### Local LM Studio Setup -- Honcho can use LM Studio for generation through the `custom` provider path. +- Honcho can use LM Studio via `transport: openai` + `MODEL_CONFIG__OVERRIDES__BASE_URL: http://localhost:1234/v1`. - Keep `LLM_OPENAI_API_KEY` configured for embeddings unless embedding support is added for local models. -- For Docker Compose, `LLM_OPENAI_COMPATIBLE_BASE_URL` must be `http://host.docker.internal:1234/v1`, not `http://localhost:1234/v1`. -- `LLM_OPENAI_COMPATIBLE_API_KEY=lm-studio` is sufficient for local use. +- For Docker Compose, the per-component `MODEL_CONFIG__OVERRIDES__BASE_URL` must be `http://host.docker.internal:1234/v1`, not `http://localhost:1234/v1`. +- Pass `MODEL_CONFIG__OVERRIDES__API_KEY: lm-studio` (or any non-empty placeholder); LM Studio doesn't validate it. - Current local default model is `qwen2.5-14b-instruct`. -- When overriding `DIALECTIC_LEVELS__*` via env vars, each level needs its full required settings, not just `PROVIDER` and `MODEL`. Include `THINKING_BUDGET_TOKENS` and `MAX_TOOL_ITERATIONS`, and optionally `MAX_OUTPUT_TOKENS`. +- When overriding `DIALECTIC_LEVELS__*` via env vars, each level needs its full required settings, not just `MODEL_CONFIG__TRANSPORT` and `__MODEL`. Include `__THINKING_BUDGET_TOKENS` and `MAX_TOOL_ITERATIONS`, and optionally `MAX_OUTPUT_TOKENS`. For backups, use the nested `__MODEL_CONFIG__FALLBACK__TRANSPORT` / `__MODEL` shape. - Docker should own the runtime environment completely. Do not mount the repo onto `/app` and do not mount a named volume onto `/app/.venv`, or the image-built environment can be hidden and replaced with incompatible artifacts. - If Docker services fail with missing Python modules or incompatible native extensions, rebuild the image instead of trying to repair the environment in-place: