fix(langchain): stop double-counting anthropic cache tokens in prompt totals by bhaveshklaviyo · Pull Request #504 · braintrustdata/braintrust-sdk-python

bhaveshklaviyo · 2026-06-09T22:16:46Z

Summary

BraintrustCallbackHandler double-counts Anthropic prompt-cache tokens: prompt_tokens on every cached ChatAnthropic span is inflated by exactly cache_read + cache_creation, roughly 2× the real prompt size once a cache is warm. We noticed this because Braintrust token/cost numbers diverged from the same trace exported via OpenInference/OTel (e.g. a request with 37,694 input tokens — 37,324 of them cache reads, 369 cache writes — was reported as 75,387 prompt tokens).

Root cause

The cache normalization added in #411 (and scoped in #445 via _cache_tokens_are_separate_from_input_tokens) assumes Anthropic-style integrations report cache tokens separately from input_tokens, detecting that convention by the presence of cache_creation / ephemeral_*m_input_tokens keys in input_token_details.

That premise doesn't hold for any version of langchain-anthropic that can reach this code path:

langchain-anthropic ≥ 0.2.3 explicitly folds cache tokens into input_tokens (_create_usage_metadata: "Anthropic's input_tokens excludes cached tokens, so we manually add cache_read and cache_creation tokens to get the true total"), while still emitting the detail keys that trip the heuristic.
langchain-anthropic ≤ 0.2.0 reported uncached input only, but didn't emit input_token_details at all, so the folding block never runs for those versions either.
langchain-aws (ChatBedrockConverse) and langchain-openai also fold cache tokens into input_tokens, per the langchain-core UsageMetadata contract (input_token_details is a breakdown of input_tokens, not an addition to it).

So the heuristic fires on every cached Anthropic response and adds the cache tokens a second time. The existing VCR test didn't catch it because its assertion (prompt_tokens >= cache_creation_tokens) holds for the doubled value too.

Fix

Detect separate cache-token accounting arithmetically instead of by key presence: fold cache tokens into prompt_tokens/total_tokens only when cache tokens exceed the reported prompt total — impossible under the UsageMetadata contract, but exactly the inconsistency ("cache creation tokens exceeded total tokens", BT-5150) that the original normalization was added to repair. This keeps #411's protection for any integration that genuinely reports uncached input only, keeps #445's OpenAI behavior, and keeps #455's TTL-split cache-creation metrics untouched.

Tests

Strengthened the cassette-backed test_prompt_caching_tokens to assert span prompt_tokens/total_tokens equal the usage_metadata LangChain reported (red on main: assert 2170 == 1095 with the existing cassettes; green with the fix).
Added unit coverage for the three conventions: folded (Anthropic ≥ 0.2.3, with TTL split), subset (OpenAI cache_read-only, restoring the fix(langchain): avoid double-counting cached input tokens #445 regression test removed in fix(langchain): preserve anthropic cache metrics #455), and separate (legacy — still normalized).
nox -s "test_langchain(latest)" and nox -s "test_langchain(0.3.28)" pass; pylint and pre-commit hooks pass.

🤖 Generated with Claude Code

… totals langchain-anthropic has folded cache read/creation tokens into usage_metadata input_tokens since 0.2.3 (versions before that don't emit input_token_details at all), and langchain-aws does the same — per the langchain-core UsageMetadata contract, input_token_details is a breakdown of input_tokens, not an addition to it. The cache normalization from braintrustdata#411/braintrustdata#445 detected "separate cache token accounting" by the presence of cache_creation/ephemeral_* detail keys, which langchain-anthropic always emits, so every cached Anthropic call had cache tokens added to prompt_tokens a second time. With a warm cache this roughly doubles reported prompt tokens (e.g. a real trace reported 75,387 prompt tokens for a 37,694-token request with 37,324 cache reads and 369 cache writes). Detect separate accounting arithmetically instead: only fold cache tokens into prompt/total when they exceed the reported prompt total, which is impossible under the UsageMetadata contract but is exactly the inconsistency the original normalization (BT-5150) was added to repair. Strengthen the VCR prompt-caching test to assert span prompt/total tokens equal the usage_metadata the model reported, and add unit coverage for the folded (Anthropic), subset (OpenAI), and separate (legacy) conventions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(langchain): stop double-counting anthropic cache tokens in prompt totals#504

fix(langchain): stop double-counting anthropic cache tokens in prompt totals#504
bhaveshklaviyo wants to merge 1 commit into
braintrustdata:mainfrom
bhaveshklaviyo:fix/langchain-anthropic-cache-double-count

bhaveshklaviyo commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bhaveshklaviyo commented Jun 9, 2026

Summary

Root cause

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant