Skip to content

feat(otel): instrument runtime with GenAI semantic conventions#2620

Draft
tdabasinskas wants to merge 14 commits into
docker:mainfrom
cogvel:feat/otel-genai-semconv
Draft

feat(otel): instrument runtime with GenAI semantic conventions#2620
tdabasinskas wants to merge 14 commits into
docker:mainfrom
cogvel:feat/otel-genai-semconv

Conversation

@tdabasinskas

@tdabasinskas tdabasinskas commented May 4, 2026

Copy link
Copy Markdown
Contributor

Adds end-to-end OpenTelemetry instrumentation following the GenAI semantic conventions:

  • Provider-layer chat/embeddings/rerank CLIENT spans with gen_ai.* attributes and the gen_ai.client.token.usage / operation.duration histograms.
  • Runtime spans (runtime.session, runtime.stream, runtime.fallback, runtime.tool.call, runtime.run_skill, runtime.task_transfer, runtime.handoff, background_agent.run).
  • MCP client + server spans with params._meta propagation, plus OAuth flow spans.
  • A2A endpoints wrapped with otelhttp and marked as invoke_agent.
  • Hook executor span with verdict/decision/reason annotation; subprocess trace context propagation for hooks, LSP servers, and sandbox docker exec.
  • Memory, RAG, sessiontitle, evaluation, anthropic-specific spans.
  • Built-in tool internals (shell, filesystem, fetch, lsp, codemode, ...) surface their work as span attributes.
  • W3C trace context + baggage propagation across all HTTP servers and clients.
  • Standard OTel resource attributes (service.*, host.*, process.*, os.type)

This PR wires two opt-in env vars beyond the default OTel SDK ones:

  • OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT — capture prompts, responses, tool arguments and tool results as span attributes. Off by default (PII surface).
  • OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental — emit only the spec-defined gen_ai.* keys. Default is dual-emit (both gen_ai.* and the legacy tool.name / agent / session.id keys), so existing dashboards keep working alongside spec-aware tooling.

The diff is large — ~50 files, ~5k lines. It's split into 10 topical commits (telemetry primitives → SDK init → providers → runtime → hooks → MCP → A2A → servers/cold-start → memory/RAG → tool internals) so each commit is independently reviewable. Most of the volume is in the new pkg/telemetry/genai/ and pkg/telemetry/mcp/ packages, which are pure helpers; the surface-area changes elsewhere are 1-3 lines per call site.


@tdabasinskas tdabasinskas requested a review from a team as a code owner May 4, 2026 07:49
@tdabasinskas tdabasinskas mentioned this pull request May 4, 2026
@tdabasinskas tdabasinskas marked this pull request as draft May 4, 2026 07:58
@tdabasinskas tdabasinskas marked this pull request as ready for review May 4, 2026 08:52
@tdabasinskas tdabasinskas force-pushed the feat/otel-genai-semconv branch from fa4a01d to 2a69313 Compare May 4, 2026 11:16
@dgageot

dgageot commented May 4, 2026

Copy link
Copy Markdown
Member

@tdabasinskas not sure why, GitHub doesn't want to merge this one, because of hypothetical merge conflicts. Could you rebase?

@tdabasinskas tdabasinskas force-pushed the feat/otel-genai-semconv branch from 2a69313 to 9b08feb Compare May 4, 2026 18:40
@tdabasinskas

Copy link
Copy Markdown
Contributor Author

@tdabasinskas not sure why, GitHub doesn't want to merge this one, because of hypothetical merge conflicts. Could you rebase?

Done!

@tdabasinskas tdabasinskas force-pushed the feat/otel-genai-semconv branch 2 times, most recently from e7194da to b6a181b Compare May 5, 2026 08:02
@tdabasinskas tdabasinskas marked this pull request as draft May 5, 2026 12:26
@tdabasinskas tdabasinskas marked this pull request as ready for review May 5, 2026 13:31
@aheritier

Copy link
Copy Markdown
Contributor

/review

@tdabasinskas

tdabasinskas commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

/review

I don't think that worked 😅

@aheritier

Copy link
Copy Markdown
Contributor

/review

aheritier
aheritier previously approved these changes May 6, 2026

@aheritier aheritier left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Clean design, solid thread safety, good spec adherence. The inline comments are all non-blocking suggestions for follow-up.

Comment thread pkg/telemetry/genai/errors.go
Comment thread pkg/telemetry/genai/span.go
Comment thread pkg/telemetry/genai/metrics.go
@docker-agent

docker-agent Bot commented May 6, 2026

Copy link
Copy Markdown

PR Review Failed — The review agent encountered an error and could not complete the review. View logs.

@aheritier aheritier added kind/feat PR adds a new feature (maps to feat: commit prefix) area/agent For work that has to do with the general agent loop/agentic features of the app priority:medium labels May 6, 2026
@tdabasinskas tdabasinskas requested a review from aheritier May 7, 2026 07:57
@dgageot

dgageot commented May 7, 2026

Copy link
Copy Markdown
Member

@tdabasinskas can you rebase one more time and I'll review it?

@tdabasinskas

Copy link
Copy Markdown
Contributor Author

@tdabasinskas can you rebase one more time and I'll review it?

Done!

aheritier
aheritier previously approved these changes May 7, 2026

@aheritier aheritier left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving — my prior approval was dismissed by the merge of upstream/main into the branch, but there are zero new author code changes since a4ce95e8. All three of my previous comments were addressed and the threads are resolved. CI is green on the merge commit.

Original assessment stands: clean design, solid thread safety, good GenAI semconv adherence. LGTM.

@aheritier aheritier added effort:large go Pull requests that update go code labels May 7, 2026
Every toolset goes through tools.WithName in the team-loader
registry, which sandwiches a *tools.namedToolSet between the
StartableToolSet and the actual implementation. %T on the
embedded ToolSet therefore always reported *tools.namedToolSet
regardless of whether the inner toolset was MCP, A2A, a builtin,
or anything else - so the attribute could never answer the
question it exists to answer ("which kind of toolset is starting
right now?").

Unwrap once before formatting, mirroring what DescribeToolSet
already does for the same reason. Now the attribute reads
*mcp.Toolset, *builtin.ShellTool, etc., so a toolset.start
without HTTP children is immediately distinguishable from a
remote MCP whose POSTs are missing for some other reason.
Record tool counts at two key points in the execution flow:

- Session span: total tools available after exclusion filters
- MCP list span: tools successfully yielded by each server

These attributes enable quick analysis of tool availability without inspecting nested spans or JSON-RPC payloads. The MCP count preserves partial results when iteration terminates early.
…errors

Introduce a `classifyByStatusCode` helper that probes for an HTTP status code via a `StatusCode() int` method before falling back to substring matching. This prevents false positives when error messages incidentally contain strings like "401", "403", or "429" in request IDs, byte counts, or status-line fragments.

Providers that expose HTTP status codes through a structured interface now get classified from the structural signal, while text-only errors continue to use the existing heuristic.

Also add documentation clarifying that `getInstruments` binds to the global MeterProvider on first call via `sync.Once`, which affects test setup requirements.
@tdabasinskas tdabasinskas force-pushed the feat/otel-genai-semconv branch from b43ca96 to 79bc9eb Compare May 26, 2026 11:11
@tdabasinskas tdabasinskas marked this pull request as ready for review May 26, 2026 11:11
@aheritier aheritier added status/needs-rebase PR has merge conflicts or is out of date with main and removed status/needs-rebase PR has merge conflicts or is out of date with main labels May 26, 2026
@aheritier aheritier marked this pull request as draft May 27, 2026 06:31
@aheritier aheritier added area/api For features/issues/fixes related to the usage of the cagent API area/mcp MCP protocol, MCP tool servers, integration labels May 27, 2026
@aheritier aheritier added area/providers For features/issues/fixes related to LLM providers (Bedrock, LiteLLM, Qwen, custom, etc.) area/sessions For features/issues/fixes related to session lifecycle (resume, persistence, export) area/skills Skills system and custom slash commands area/tools For features/issues/fixes related to the usage of built-in and MCP tools area/config For configuration parsing, YAML, environment variables area/cli CLI commands, flags, output formatting area/rag For work/issues that have to do with the RAG features area/mcp MCP protocol, MCP tool servers, integration and removed area/config For configuration parsing, YAML, environment variables area/tools For features/issues/fixes related to the usage of built-in and MCP tools area/api For features/issues/fixes related to the usage of the cagent API area/rag For work/issues that have to do with the RAG features area/sessions For features/issues/fixes related to session lifecycle (resume, persistence, export) area/cli CLI commands, flags, output formatting area/mcp MCP protocol, MCP tool servers, integration area/skills Skills system and custom slash commands labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/agent For work that has to do with the general agent loop/agentic features of the app area/cli CLI commands, flags, output formatting area/mcp MCP protocol, MCP tool servers, integration area/providers For features/issues/fixes related to LLM providers (Bedrock, LiteLLM, Qwen, custom, etc.) area/rag For work/issues that have to do with the RAG features area/sessions For features/issues/fixes related to session lifecycle (resume, persistence, export) area/tools For features/issues/fixes related to the usage of built-in and MCP tools kind/feat PR adds a new feature (maps to feat: commit prefix) status/needs-design Requires architectural discussion or design review status/needs-rebase PR has merge conflicts or is out of date with main

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OTEL, again

5 participants