Skip to content

Add OTLP job log export#3979

Open
catkins wants to merge 3 commits into
mainfrom
catkins/codex/otlp-job-logs
Open

Add OTLP job log export#3979
catkins wants to merge 3 commits into
mainfrom
catkins/codex/otlp-job-logs

Conversation

@catkins

@catkins catkins commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Description

Adds an opt-in agent-side job log OTLP sink. When --job-logs-otlp / BUILDKITE_JOB_LOGS_OTLP is enabled, the agent emits job output as OpenTelemetry log records using the existing OTLP exporter environment configuration instead of encoding timestamps into ANSI/OSC job-log output.

The OTLP endpoint and transport are intentionally inherited from the OpenTelemetry exporter configuration rather than adding Buildkite-specific endpoint flags. For logs, set OTEL_EXPORTER_OTLP_LOGS_ENDPOINT for a log-specific endpoint or OTEL_EXPORTER_OTLP_ENDPOINT for the generic endpoint; protocol selection follows OTEL_EXPORTER_OTLP_LOGS_PROTOCOL / OTEL_EXPORTER_OTLP_PROTOCOL.

The log records carry native OTLP timestamps plus a small set of Buildkite attributes for correlation: organization, pipeline, branch, queue, agent, build, job, current phase, and current hook scope/plugin where known. Trace correlation uses the native OTLP LogRecord trace context fields rather than duplicating trace_id / span_id as log attributes.

When OpenTelemetry tracing is enabled, OTLP job logs are emitted from the bootstrap process around the actual hook/command execution contexts. That means hook output records carry the same native OTLP TraceId and SpanId as the corresponding exported hook span. When tracing is disabled, the parent agent process still emits OTLP logs, but they remain intentionally uncorrelated. --tracing-propagate-traceparent only controls accepting the Buildkite control-plane traceparent; it is not required for local agent trace/log correlation.

This spike emits visible child process stdout/stderr as OTLP log records. Bootstrap-generated control lines such as prompts, section headers, and warnings remain in the normal Buildkite job log stream and are not separately emitted as correlated OTLP records in this PR.

Relevant OpenTelemetry references:

Context

Slack context: https://buildkite-corp.slack.com/archives/C05R3MTRK38/p1780282420246339

This is a spike for moving high-rate job-log consumption toward agent-side OTLP log export, using the existing OpenTelemetry SDK plumbing in the agent.

Changes

  • Adds --job-logs-otlp and BUILDKITE_JOB_LOGS_OTLP to buildkite-agent start.
  • Adds an OTLP job log writer that line-buffers process output and emits OpenTelemetry LogRecords with native timestamps.
  • Uses the existing OpenTelemetry exporter environment configuration for endpoint/protocol selection instead of adding Buildkite-specific endpoint flags.
  • Wires untraced/logs-only OTLP output in the parent job runner, before ANSI/timestamp wrapping, so OTLP output keeps structured timestamps rather than timestamp-encoded log bodies.
  • Wires traced OTLP output inside bootstrap command execution so hook and command output records carry the active hook/command span context.
  • Adds Buildkite log attributes using the buildkite.* naming requested for agent/build/job fields.
  • Emits trace/span correlation through native OTLP LogRecord fields, not duplicated log attributes.
  • Keeps Buildkite control-plane traceparent acceptance behind the existing --tracing-propagate-traceparent opt-in.
  • Detaches stored OTLP emit contexts from command cancellation while preserving trace context, so cleanup flushes do not inherit a canceled command context.

CLI help excerpt:

--job-log-path value                            Location to store job logs created by configuring ′enable-job-log-tmpfile`, by default job log will be stored in TempDir [$BUILDKITE_JOB_LOG_PATH]
--write-job-logs-to-stdout                      Writes job logs to the agent process' stdout. This simplifies log collection if running agents in Docker (default: false) [$BUILDKITE_WRITE_JOB_LOGS_TO_STDOUT]
--job-logs-otlp                                 Export job logs directly as OpenTelemetry log records using the OTEL_EXPORTER_OTLP_LOGS_* / OTEL_EXPORTER_OTLP_* environment configuration (default: false) [$BUILDKITE_JOB_LOGS_OTLP]
--shell value                                   The shell command used to interpret build commands, e.g /bin/bash -e -c (default: "/bin/bash -e -c") [$BUILDKITE_SHELL]

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go tool gofumpt -extra -w .)

Additional verification:

  • mise exec -- go test ./... -count=1
  • mise exec -- go test ./agent ./clicommand ./internal/job ./internal/shell -count=1
  • mise exec -- go build -o /tmp/buildkite-agent-otlp-e2e .
  • git diff --check
  • Local Buildkite + ClickStack matrix verified all eight combinations of tracing/logging enabled/disabled with --tracing-propagate-traceparent on/off after rebasing on origin/main in run run20260607174652.
  • ClickStack span-correlation checks verified OTLP_E2E_PRE_COMMAND_HOOK log records have the same TraceId and SpanId as the exported agent pre-command hook span for both propagation off and propagation on:
    • propagation off: trace_matches=1, span_matches=1, TraceId 01663347ec037b04513db740258198d9, SpanId b09240dd1888a198
    • propagation on: trace_matches=1, span_matches=1, TraceId 26addc8b00cb1bee25c3c11bf4983936, SpanId 9b67d731105962a6

Note: earlier local test runs hit known-looking flakes (agent TestAgentWorker_UpdateEndpointDuringPing_FailAndRevert and internal/shell TestContextCancelInterrupts/SIGINT); both passed on rerun, and the final full mise exec -- go test ./... -count=1 succeeded.

Affiliation (optional, external contributors)

Buildkite.

Disclosures / Credits

Codex assisted with implementation, adversarial review, local verification, and drafting this PR description.

@socket-security

socket-security Bot commented Jun 5, 2026

Copy link
Copy Markdown

@catkins catkins added the feature New user-facing feature! label Jun 5, 2026
@catkins catkins force-pushed the catkins/codex/otlp-job-logs branch 2 times, most recently from 5833518 to 60aa8fe Compare June 6, 2026 00:56
@catkins catkins marked this pull request as ready for review June 6, 2026 00:59
@catkins catkins requested review from a team as code owners June 6, 2026 00:59

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 60aa8fe29d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread agent/otlp_job_logger.go Outdated
@catkins catkins force-pushed the catkins/codex/otlp-job-logs branch from 60aa8fe to 2566837 Compare June 6, 2026 02:55

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2566837108

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread agent/otlp_job_logger.go
semconv.DeploymentEnvironmentKey.String("ci"),
)
provider := sdklog.NewLoggerProvider(
sdklog.WithProcessor(sdklog.NewBatchProcessor(exporter)),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid dropping OTLP records under bursty output

When a job emits more log lines than sdklog.NewBatchProcessor can queue before the collector accepts them, such as a verbose test suite or a slow/unreachable OTLP endpoint, the batch processor drops excess records instead of back-pressuring the writer. Since this is the only OTLP export path here and Emit cannot report the loss back to Write, the exported job log can be silently incomplete for normal CI bursts; use a blocking/synchronous processor or explicitly configure the queue/export behavior so job-log lines are not lost by default.

Useful? React with 👍 / 👎.

@catkins catkins force-pushed the catkins/codex/otlp-job-logs branch from 25d9724 to a4d79b0 Compare June 7, 2026 06:50

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4d79b0c8d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


line := append(w.buf, data[:i]...)
line = bytes.TrimSuffix(line, []byte{'\r'})
w.emit(string(line))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Redact before emitting OTLP logs

When --job-logs-otlp is used with tracing-backend=opentelemetry, this writer is installed inside Executor.Run after setupRedactors, so w.out.Write(data) sends the bytes through the normal redactor but w.emit(string(line)) exports the original pre-redaction bytes to OTLP. Any secret masked from the visible Buildkite log via RedactedVars, secrets, or Job API redaction will still be sent in clear text to the OTLP collector; emit from the redacted stream or put the OTLP writer behind the same redactor.

Useful? React with 👍 / 👎.

Comment thread internal/job/executor.go Outdated
Comment on lines +1323 to +1325
"buildkite.phase": "hook",
"buildkite.hook.name": "command",
"buildkite.hook.scope": "default",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Mark command output as command phase

For OpenTelemetry-backed job-log export, normal command output goes through this attributes map, but it is tagged as buildkite.phase=hook with buildkite.hook.name=command. Fresh evidence beyond the earlier scope-parser comment is this separate hard-coded internal exporter path, which makes ordinary command logs look like hook logs whenever BUILDKITE_JOB_LOGS_OTLP and the OpenTelemetry tracing backend are enabled; use the command phase attributes here instead.

Useful? React with 👍 / 👎.

@catkins catkins force-pushed the catkins/codex/otlp-job-logs branch from a4d79b0 to 852196b Compare June 7, 2026 07:26
@catkins catkins force-pushed the catkins/codex/otlp-job-logs branch from 852196b to 8851f51 Compare June 7, 2026 07:55

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8851f514f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/shell/shell.go
Comment on lines +497 to +500
var flushers []interface{ Flush() }
if c.shell.outputInterceptor != nil {
if cfg.captureStdout == nil && stdout != nil && stdout != io.Discard {
stdout = c.shell.outputInterceptor(ctx, stdout, cfg.outputAttrs)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include shell-generated log lines in OTLP

When BUILDKITE_JOB_LOGS_OTLP runs with the OpenTelemetry tracing backend, the parent exporter is disabled in agent/job_runner.go, so this OutputInterceptor becomes the only OTLP log path; however it is installed only around Command.Run stdout/stderr here. Lines emitted directly by the bootstrap shell logger (Headerf, Printf, Warningf, prompts, checkout comments, etc.) still write to the normal shell writer and never pass through the interceptor, so visible job log lines generated by the agent are missing from OTLP in that configuration. Wrap the shell logger/stdout as well, or keep a full job-log exporter without duplicating command output.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New user-facing feature!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant