Skip to content

Add comprehensive queue health monitoring and observability#3

Merged
offendingcommit merged 11 commits intoupstream-syncfrom
claude/resolve-pr-conflicts-J6GYN
Apr 25, 2026
Merged

Add comprehensive queue health monitoring and observability#3
offendingcommit merged 11 commits intoupstream-syncfrom
claude/resolve-pr-conflicts-J6GYN

Conversation

@offendingcommit
Copy link
Copy Markdown
Owner

Summary

This PR adds comprehensive Prometheus metrics and Grafana dashboards for monitoring queue health, session activity, and API performance. It also includes infrastructure improvements for local development with Traefik routing and message batching optimizations.

Key Changes

Observability & Metrics

  • New Prometheus metrics for queue monitoring:

    • deriver_queue_depth - Queue depth by workspace, task type, and state (pending/in_progress)
    • deriver_queue_oldest_age_seconds - Age of oldest pending/in_progress items
    • deriver_queue_error_backlog - Count of errored items retained in queue
    • deriver_queue_errors_total - Total queue processing errors
    • deriver_queue_item_latency_seconds - Histogram of item latency from enqueue to terminal state
    • deriver_active_workers - Current active worker count
    • api_request_duration_seconds - API request latency histogram
    • Session metrics: sessions_active, session_last_message_age_seconds, session_queue_depth, session_queue_oldest_age_seconds
    • Additional counters: deriver_queue_items_enqueued, session_context_requests, session_search_requests
  • Queue health refresh loop in QueueManager.refresh_queue_health_metrics():

    • Runs on a 5-second interval to collect queue statistics
    • Tracks pending, in-progress, and errored items by workspace and task type
    • Calculates oldest item ages for SLA monitoring
    • Maintains label sets to ensure metrics are properly cleaned up
  • Two new Grafana dashboards:

    • honcho-overview.json - High-level system metrics (API requests, throughput, message creation)
    • honcho-queue-health.json - Detailed queue monitoring (depth, latency, error rates, worker status)

Infrastructure & Development

  • Traefik reverse proxy integration:

    • Added traefik service to docker-compose for request routing
    • Dynamic routing configuration in docker/traefik/dynamic.yml
    • Handles path-based routing, middleware, and health checks
    • Exposes dashboard on port 8080
  • Docker improvements:

    • Made entrypoint script executable in Dockerfile
    • Simplified entrypoint to use system Python instead of venv
    • Updated docker-compose to use Traefik for port binding

SDK & API Enhancements

  • Message batching in Python SDK:

    • Added MAX_MESSAGES_PER_BATCH = 100 constant
    • Both sync and async add_messages() methods now batch large message lists
    • Prevents overwhelming the API with single large requests
  • Metrics recording in API routes:

    • record_api_request_duration() for latency tracking
    • record_messages_created() now includes session_name label
    • Queue enqueue operations record metrics

Configuration & Documentation

  • LLM provider documentation in CLAUDE.md with gotchas learned from k8s deployment
  • CF AI Gateway support for routing OpenAI traffic through Cloudflare proxy
  • Gitignore updates for Codex runtime directories

Implementation Details

  • Metrics use NamespacedGauge and NamespacedHistogram classes to automatically inject namespace labels
  • Queue metrics refresh is throttled to 5-second intervals to avoid excessive database queries
  • Metric label sets are tracked to ensure proper cleanup when workspaces/sessions are removed
  • Grafana dashboards use Prometheus queries with appropriate thresholds for alerting (yellow/red states)
  • Traefik configuration includes middleware for path rewriting and documentation routing

https://claude.ai/code/session_01RUjFSXFxCVzym2GV5Ydkx9

offendingcommit and others added 11 commits April 3, 2026 17:48
The surprisal observation fetch passed a list directly as the filter
value ({"level": [...]}), which generated invalid SQL (level = ARRAY)
instead of level IN (...). Use the {"in": [...]} operator syntax.
… and Grafana dashboards

Add Docker infrastructure for local development with LM Studio as LLM provider,
Prometheus metrics collection with custom histograms, Traefik reverse proxy
configuration, and Grafana dashboard provisioning. Update SDK session handling
and deriver queue management for improved reliability.
…hinking models

- Add `cf` provider (Cloudflare AI Gateway) to SupportedProviders and initialize
  AsyncOpenAI client pointed at CF_GATEWAY_BASE_URL
- Route OpenAI embeddings through CF Gateway when LLM_OPENAI_BASE_URL is set
- Convert tools to OpenAI format for `cf` provider (was missing from provider list)
- Extract thought_signature from OpenAI-compat tool call responses and re-include
  it when formatting assistant messages for multi-turn replay — fixes 400
  INVALID_ARGUMENT from Gemini thinking models via CF Gateway
- Preserve thought_signature in _format_assistant_tool_message else branch
- Increase DERIVER_MAX_INPUT_TOKENS upper bound (23000 → 200000) to allow
  higher limits via config
When CF_GATEWAY_AUTH_TOKEN is set, inject cf-aig-authorization header
into the custom client so CF Gateway-proxied custom providers (e.g.
custom-ollama) authenticate correctly at the gateway layer.
…rides

Adds DEDUCTION_PROVIDER/INDUCTION_PROVIDER and matching THINKING_BUDGET_TOKENS
settings so deduction and induction specialists can route to a different
provider than the main DREAM config. Also propagates thinking_budget_tokens
into the LLM call and documents the CF gateway / Gemini thought_signature
gotchas in CLAUDE.md.
Allows deployments (e.g. the infra chart) to configure CORS origins
via a comma-separated CORS_ORIGINS env var instead of relying on the
hardcoded list. Falls back to the previous defaults when unset.
Resolve conflicts between fork-only commits (CF Gateway auth, Gemini
thought_signature fix, LM Studio/Prometheus/Traefik stack, dreamer
specialist overrides) and upstream's new src/llm/ transport-based
abstraction that replaces src/utils/clients.py.

Port decisions:
- Dropped fork's cf / custom / vllm / groq providers — superseded by the
  new ModelConfig base_url/api_key override mechanism.
- Kept OPENAI_BASE_URL and CF_GATEWAY_AUTH_TOKEN on LLMSettings and wired
  them into src/llm/registry (default + override OpenAI clients) and
  src/embedding_client so CF AI Gateway routing survives the refactor.
- Ported thought_signature extraction into OpenAIBackend and replay into
  OpenAIHistoryAdapter so Gemini thinking models via the CF OpenAI-compat
  route can do multi-turn tool loops without 400ing.
- Dropped fork's DEDUCTION_PROVIDER / INDUCTION_PROVIDER and matching
  THINKING_BUDGET_TOKENS fields — upstream's per-specialist
  DEDUCTION_MODEL_CONFIG / INDUCTION_MODEL_CONFIG (full
  ConfiguredModelSettings) is a strict superset.
- Kept fork's traefik+prometheus+grafana docker-compose stack; kept
  upstream's broader docker/ COPY in the Dockerfile.
basedpyright with reportMissingTypeArgument rejected the bare `dict`
types in the mock fake_post used by the SDK message-batching test,
failing Static Analysis on PR #3. Add `dict[str, Any]` annotations and
an explicit return type so CI stays green.
basedpyright's default exit code is non-zero whenever any diagnostics
are reported, so the 8 warnings introduced by the fork-only commits
were failing the Static Analysis job on PR #3 even though there were
no errors.

- src/deriver/queue_manager.py: drop `item.created_at is not None`
  guards. created_at is `Mapped[datetime.datetime]` (non-nullable), so
  the checks were always True and basedpyright flagged them as
  reportUnnecessaryComparison.
- tests/sdk/test_session.py: factor out the shared mock-response body
  into a single helper and give the per-branch closures distinct names.
  This clears reportRedeclaration on `calls` / `fake_post` and lets the
  `# pyright: ignore` comments target the actual warning
  (reportPrivateUsage on `_http` / `_async_http_client`) instead of the
  irrelevant reportAttributeAccessIssue that was flagged as an
  unnecessary ignore.
@offendingcommit offendingcommit merged commit 2accbbe into upstream-sync Apr 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants