Add comprehensive queue health monitoring and observability by offendingcommit · Pull Request #3 · offendingcommit/honcho

offendingcommit · 2026-04-24T20:07:03Z

Summary

This PR adds comprehensive Prometheus metrics and Grafana dashboards for monitoring queue health, session activity, and API performance. It also includes infrastructure improvements for local development with Traefik routing and message batching optimizations.

Key Changes

Observability & Metrics

New Prometheus metrics for queue monitoring:
- deriver_queue_depth - Queue depth by workspace, task type, and state (pending/in_progress)
- deriver_queue_oldest_age_seconds - Age of oldest pending/in_progress items
- deriver_queue_error_backlog - Count of errored items retained in queue
- deriver_queue_errors_total - Total queue processing errors
- deriver_queue_item_latency_seconds - Histogram of item latency from enqueue to terminal state
- deriver_active_workers - Current active worker count
- api_request_duration_seconds - API request latency histogram
- Session metrics: sessions_active, session_last_message_age_seconds, session_queue_depth, session_queue_oldest_age_seconds
- Additional counters: deriver_queue_items_enqueued, session_context_requests, session_search_requests
Queue health refresh loop in QueueManager.refresh_queue_health_metrics():
- Runs on a 5-second interval to collect queue statistics
- Tracks pending, in-progress, and errored items by workspace and task type
- Calculates oldest item ages for SLA monitoring
- Maintains label sets to ensure metrics are properly cleaned up
Two new Grafana dashboards:
- honcho-overview.json - High-level system metrics (API requests, throughput, message creation)
- honcho-queue-health.json - Detailed queue monitoring (depth, latency, error rates, worker status)

Infrastructure & Development

Traefik reverse proxy integration:
- Added traefik service to docker-compose for request routing
- Dynamic routing configuration in docker/traefik/dynamic.yml
- Handles path-based routing, middleware, and health checks
- Exposes dashboard on port 8080
Docker improvements:
- Made entrypoint script executable in Dockerfile
- Simplified entrypoint to use system Python instead of venv
- Updated docker-compose to use Traefik for port binding

SDK & API Enhancements

Message batching in Python SDK:
- Added MAX_MESSAGES_PER_BATCH = 100 constant
- Both sync and async add_messages() methods now batch large message lists
- Prevents overwhelming the API with single large requests
Metrics recording in API routes:
- record_api_request_duration() for latency tracking
- record_messages_created() now includes session_name label
- Queue enqueue operations record metrics

Configuration & Documentation

LLM provider documentation in CLAUDE.md with gotchas learned from k8s deployment
CF AI Gateway support for routing OpenAI traffic through Cloudflare proxy
Gitignore updates for Codex runtime directories

Implementation Details

Metrics use NamespacedGauge and NamespacedHistogram classes to automatically inject namespace labels
Queue metrics refresh is throttled to 5-second intervals to avoid excessive database queries
Metric label sets are tracked to ensure proper cleanup when workspaces/sessions are removed
Grafana dashboards use Prometheus queries with appropriate thresholds for alerting (yellow/red states)
Traefik configuration includes middleware for path rewriting and documentation routing

https://claude.ai/code/session_01RUjFSXFxCVzym2GV5Ydkx9

The surprisal observation fetch passed a list directly as the filter value ({"level": [...]}), which generated invalid SQL (level = ARRAY) instead of level IN (...). Use the {"in": [...]} operator syntax.

… and Grafana dashboards Add Docker infrastructure for local development with LM Studio as LLM provider, Prometheus metrics collection with custom histograms, Traefik reverse proxy configuration, and Grafana dashboard provisioning. Update SDK session handling and deriver queue management for improved reliability.

…hinking models - Add `cf` provider (Cloudflare AI Gateway) to SupportedProviders and initialize AsyncOpenAI client pointed at CF_GATEWAY_BASE_URL - Route OpenAI embeddings through CF Gateway when LLM_OPENAI_BASE_URL is set - Convert tools to OpenAI format for `cf` provider (was missing from provider list) - Extract thought_signature from OpenAI-compat tool call responses and re-include it when formatting assistant messages for multi-turn replay — fixes 400 INVALID_ARGUMENT from Gemini thinking models via CF Gateway - Preserve thought_signature in _format_assistant_tool_message else branch - Increase DERIVER_MAX_INPUT_TOKENS upper bound (23000 → 200000) to allow higher limits via config

When CF_GATEWAY_AUTH_TOKEN is set, inject cf-aig-authorization header into the custom client so CF Gateway-proxied custom providers (e.g. custom-ollama) authenticate correctly at the gateway layer.

…rides Adds DEDUCTION_PROVIDER/INDUCTION_PROVIDER and matching THINKING_BUDGET_TOKENS settings so deduction and induction specialists can route to a different provider than the main DREAM config. Also propagates thinking_budget_tokens into the LLM call and documents the CF gateway / Gemini thought_signature gotchas in CLAUDE.md.

Allows deployments (e.g. the infra chart) to configure CORS origins via a comma-separated CORS_ORIGINS env var instead of relying on the hardcoded list. Falls back to the previous defaults when unset.

Resolve conflicts between fork-only commits (CF Gateway auth, Gemini thought_signature fix, LM Studio/Prometheus/Traefik stack, dreamer specialist overrides) and upstream's new src/llm/ transport-based abstraction that replaces src/utils/clients.py. Port decisions: - Dropped fork's cf / custom / vllm / groq providers — superseded by the new ModelConfig base_url/api_key override mechanism. - Kept OPENAI_BASE_URL and CF_GATEWAY_AUTH_TOKEN on LLMSettings and wired them into src/llm/registry (default + override OpenAI clients) and src/embedding_client so CF AI Gateway routing survives the refactor. - Ported thought_signature extraction into OpenAIBackend and replay into OpenAIHistoryAdapter so Gemini thinking models via the CF OpenAI-compat route can do multi-turn tool loops without 400ing. - Dropped fork's DEDUCTION_PROVIDER / INDUCTION_PROVIDER and matching THINKING_BUDGET_TOKENS fields — upstream's per-specialist DEDUCTION_MODEL_CONFIG / INDUCTION_MODEL_CONFIG (full ConfiguredModelSettings) is a strict superset. - Kept fork's traefik+prometheus+grafana docker-compose stack; kept upstream's broader docker/ COPY in the Dockerfile.

basedpyright with reportMissingTypeArgument rejected the bare `dict` types in the mock fake_post used by the SDK message-batching test, failing Static Analysis on PR #3. Add `dict[str, Any]` annotations and an explicit return type so CI stays green.

basedpyright's default exit code is non-zero whenever any diagnostics are reported, so the 8 warnings introduced by the fork-only commits were failing the Static Analysis job on PR #3 even though there were no errors. - src/deriver/queue_manager.py: drop `item.created_at is not None` guards. created_at is `Mapped[datetime.datetime]` (non-nullable), so the checks were always True and basedpyright flagged them as reportUnnecessaryComparison. - tests/sdk/test_session.py: factor out the shared mock-response body into a single helper and give the per-branch closures distinct names. This clears reportRedeclaration on `calls` / `fake_post` and lets the `# pyright: ignore` comments target the actual warning (reportPrivateUsage on `_http` / `_async_http_client`) instead of the irrelevant reportAttributeAccessIssue that was flagged as an unnecessary ignore.

offendingcommit and others added 11 commits April 3, 2026 17:48

fix(dreamer): use correct filter syntax for surprisal level query

4e7f136

The surprisal observation fetch passed a list directly as the filter value ({"level": [...]}), which generated invalid SQL (level = ARRAY) instead of level IN (...). Use the {"in": [...]} operator syntax.

Merge branch 'main' of https://github.com/plastic-labs/honcho

68d88bd

feat: add CF Gateway auth header to custom OpenAI-compatible client

8fcf2f0

When CF_GATEWAY_AUTH_TOKEN is set, inject cf-aig-authorization header into the custom client so CF Gateway-proxied custom providers (e.g. custom-ollama) authenticate correctly at the gateway layer.

feat(cors): read allowed origins from CORS_ORIGINS env var

d5298e3

Allows deployments (e.g. the infra chart) to configure CORS origins via a comma-separated CORS_ORIGINS env var instead of relying on the hardcoded list. Falls back to the previous defaults when unset.

Add workflow_dispatch trigger to Docker build

2e237eb

offendingcommit merged commit 2accbbe into upstream-sync Apr 25, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comprehensive queue health monitoring and observability#3

Add comprehensive queue health monitoring and observability#3
offendingcommit merged 11 commits intoupstream-syncfrom
claude/resolve-pr-conflicts-J6GYN

offendingcommit commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

offendingcommit commented Apr 24, 2026

Summary

Key Changes

Observability & Metrics

Infrastructure & Development

SDK & API Enhancements

Configuration & Documentation

Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants