Skip to content

wrapAISDK streaming: time_to_first_token measures stream framing, not first token (two distinct bugs) #2110

@cobrien704

Description

Summary

When using wrapAISDK with AI SDK v6 streaming (streamText / ToolLoopAgent.stream), time_to_first_token is wrong on both span levels that record it, for two different reasons:

  1. Parent wrapper span (streamText / ToolLoopAgent.stream): the tap timestamps the first chunk of any type through result.baseStream. AI SDK v6 pre-buffers a synthetic {type: "start"} framing part into that stream at construction time — before the provider HTTP request fires — so the recorded value is "time until the consumer starts reading" (single-digit milliseconds) and contains zero model latency.
  2. doStream child span (type llm): the clock starts after await originalDoStream(...) resolves — i.e. after the provider's HTTP response headers have arrived, when the real TTFT (request transit, queueing, prefill) has already elapsed — and the first chunk timestamped is the provider's {type: "stream-start"} framing part, available on the very first read. The two timestamps land microseconds apart, so the metric clamps to its 1e-6 floor.

Both values are content-free: they do not change whether the model takes 200ms or 20s to produce its first token.

Both bugs are present on main today (js/src/instrumentation/plugins/ai-sdk-plugin.ts), so the latest release (3.17.0) is affected as well as the version we run (3.11.0).

Environment

  • braintrust 3.11.0 (verified identical logic on main / 3.17.0)
  • ai 6.x (AI SDK v6, LanguageModelV3)
  • @ai-sdk/openai (responses API, streaming); also applies to other providers
  • Pattern: const {ToolLoopAgent} = wrapAISDK(ai)agent.stream(...)toUIMessageStream()

Bug 1 — parent span: framing chunk counted as first token

patchAISDKStreamingResult taps result.baseStream and timestamps the first chunk of any type (ai-sdk-plugin.ts:1383-1401):

if (isReadableStreamLike(resultRecord.baseStream)) {
  let firstChunkTime: number | undefined;
  const wrappedBaseStream = resultRecord.baseStream.pipeThrough(
    new TransformStream({
      transform(chunk, controller) {
        if (firstChunkTime === undefined) {
          firstChunkTime = getCurrentUnixTimestamp();   // ← any chunk type
        }
        controller.enqueue(chunk);
      },
      async flush() {
        const metrics = extractTopLevelAISDKMetrics(result, endEvent);
        if (metrics.time_to_first_token === undefined && firstChunkTime !== undefined) {
          metrics.time_to_first_token = firstChunkTime - startTime;
        }

But in AI SDK v6, the stream that feeds baseStream enqueues a synthetic start part in the ReadableStream's start() callback, which runs synchronously at stream construction — the moment streamText() / agent.stream() is invoked, before any network activity:

// ai (v6), streamText result construction
let stream = new ReadableStream({
  async start(controller) {
    controller.enqueue({ type: "start" });   // ← pre-buffered at construction
  },
  ...
this.baseStream = stream.pipeThrough(createOutputTransformStream(...)).pipeThrough(eventProcessor);

All consumer surfaces (textStream, fullStream, toUIMessageStream) read via teeStream(), which reads the instance property this.baseStream — so the Object.defineProperty(resultRecord, "baseStream", ...) patch does interpose on real consumption, and the first thing the tap observes is the pre-buffered start part:

// ai (v6)
teeStream() {
  const [stream1, stream2] = this.baseStream.tee();  // reads patched property
  this.baseStream = stream2;
  return stream1;
}

Result: time_to_first_token = (consumption start − agent.stream() call) ≈ a few ms, regardless of model latency.

The fallback branch (findAsyncIterableField over ["partialObjectStream", "textStream", "fullStream", "stream"]) has the same any-chunk-type issue for the fullStream/stream fields.

Bug 2 — doStream child span: clock starts after headers AND counts stream-start

The model patch has two compounding problems (ai-sdk-plugin.ts:1106-1120):

const result = await withCurrent(span, () =>
  Reflect.apply(originalDoStream, resolvedModel, [options]),  // ← HTTP request happens here;
);                                                            //   resolves at response HEADERS
const streamStartTime = getCurrentUnixTimestamp();            // ← clock starts AFTER that
let firstChunkTime: number | undefined;
const transformStream = new TransformStream({
  transform(chunk: AISDKModelStreamChunk, controller) {
    if (firstChunkTime === undefined) {
      firstChunkTime = getCurrentUnixTimestamp();             // ← any chunk type, incl. stream-start
    }
    ...
    case "finish":
      ...
      metrics.time_to_first_token = Math.max(
        firstChunkTime - streamStartTime,
        1e-6                                                  // ← the floor everyone actually sees
      );
  • Clock starts too late. originalDoStream() resolves when the provider's HTTP response begins. Everything that constitutes real TTFT has already happened before streamStartTime is captured.
  • First chunk is framing. Providers enqueue {type: "stream-start", warnings} in their transformer's start() callback (e.g. @ai-sdk/openai), so it's available on the first read, microseconds after streamStartTime.

Net: firstChunkTime - streamStartTime ≈ 0, clamped to 1e-6. This exact value is visible (unremarked) in the repro output of #1868, which shows "time_to_first_token": 0.000001 on a doStream span.

Production evidence

Same service, same week, same traffic (OpenAI responses API, ~70–90k token prompts, reasoning enabled):

Source p50 p95 min (7d)
AI SDK's own OTel attribute ai.response.msToFirstChunk 990 ms 2,621 ms 669 ms
Braintrust time_to_first_token (Monitor preset) ~ms < 100 ms 1e-6 s

The minimum real TTFT observed over 7 days is ~7,000× larger than the Braintrust p95. A sub-100ms p95 TTFT for a reasoning model on 70k-token prompts is not physically possible.

Why the AI SDK core gets it right

streamText computes ai.response.msToFirstChunk correctly — it explicitly skips the framing chunk before timestamping:

// ai (v6)
async transform(chunk, controller) {
  ...
  if (chunk.type === "stream-start") {
    warnings = chunk.warnings;
    return;                              // ← skipped, no timestamp
  }
  if (stepFirstChunk) {
    const msToFirstChunk = now() - startTimestampMs;
    ...

Suggested fix

Only set firstChunkTime on content-bearing part types, in both taps:

  • baseStream tap (Bug 1): ignore start, start-step, stream-start, response-metadata; timestamp on first text-delta, reasoning-delta, tool-input-start / tool-call, object, or raw.
  • doStream patch (Bug 2): additionally capture streamStartTime before await originalDoStream(...), so the request/queue/prefill phase is included (matching the AI SDK's own startTimestampMs semantics).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions