Skip to content

fix(langchain): stop double-counting anthropic cache tokens in prompt totals#504

Draft
bhaveshklaviyo wants to merge 1 commit into
braintrustdata:mainfrom
bhaveshklaviyo:fix/langchain-anthropic-cache-double-count
Draft

fix(langchain): stop double-counting anthropic cache tokens in prompt totals#504
bhaveshklaviyo wants to merge 1 commit into
braintrustdata:mainfrom
bhaveshklaviyo:fix/langchain-anthropic-cache-double-count

Conversation

@bhaveshklaviyo

Copy link
Copy Markdown

Summary

BraintrustCallbackHandler double-counts Anthropic prompt-cache tokens: prompt_tokens on every cached ChatAnthropic span is inflated by exactly cache_read + cache_creation, roughly 2× the real prompt size once a cache is warm. We noticed this because Braintrust token/cost numbers diverged from the same trace exported via OpenInference/OTel (e.g. a request with 37,694 input tokens — 37,324 of them cache reads, 369 cache writes — was reported as 75,387 prompt tokens).

Root cause

The cache normalization added in #411 (and scoped in #445 via _cache_tokens_are_separate_from_input_tokens) assumes Anthropic-style integrations report cache tokens separately from input_tokens, detecting that convention by the presence of cache_creation / ephemeral_*m_input_tokens keys in input_token_details.

That premise doesn't hold for any version of langchain-anthropic that can reach this code path:

  • langchain-anthropic ≥ 0.2.3 explicitly folds cache tokens into input_tokens (_create_usage_metadata: "Anthropic's input_tokens excludes cached tokens, so we manually add cache_read and cache_creation tokens to get the true total"), while still emitting the detail keys that trip the heuristic.
  • langchain-anthropic ≤ 0.2.0 reported uncached input only, but didn't emit input_token_details at all, so the folding block never runs for those versions either.
  • langchain-aws (ChatBedrockConverse) and langchain-openai also fold cache tokens into input_tokens, per the langchain-core UsageMetadata contract (input_token_details is a breakdown of input_tokens, not an addition to it).

So the heuristic fires on every cached Anthropic response and adds the cache tokens a second time. The existing VCR test didn't catch it because its assertion (prompt_tokens >= cache_creation_tokens) holds for the doubled value too.

Fix

Detect separate cache-token accounting arithmetically instead of by key presence: fold cache tokens into prompt_tokens/total_tokens only when cache tokens exceed the reported prompt total — impossible under the UsageMetadata contract, but exactly the inconsistency ("cache creation tokens exceeded total tokens", BT-5150) that the original normalization was added to repair. This keeps #411's protection for any integration that genuinely reports uncached input only, keeps #445's OpenAI behavior, and keeps #455's TTL-split cache-creation metrics untouched.

Tests

  • Strengthened the cassette-backed test_prompt_caching_tokens to assert span prompt_tokens/total_tokens equal the usage_metadata LangChain reported (red on main: assert 2170 == 1095 with the existing cassettes; green with the fix).
  • Added unit coverage for the three conventions: folded (Anthropic ≥ 0.2.3, with TTL split), subset (OpenAI cache_read-only, restoring the fix(langchain): avoid double-counting cached input tokens #445 regression test removed in fix(langchain): preserve anthropic cache metrics #455), and separate (legacy — still normalized).
  • nox -s "test_langchain(latest)" and nox -s "test_langchain(0.3.28)" pass; pylint and pre-commit hooks pass.

🤖 Generated with Claude Code

… totals

langchain-anthropic has folded cache read/creation tokens into
usage_metadata input_tokens since 0.2.3 (versions before that don't emit
input_token_details at all), and langchain-aws does the same — per the
langchain-core UsageMetadata contract, input_token_details is a breakdown
of input_tokens, not an addition to it.

The cache normalization from braintrustdata#411/braintrustdata#445 detected "separate cache token
accounting" by the presence of cache_creation/ephemeral_* detail keys,
which langchain-anthropic always emits, so every cached Anthropic call
had cache tokens added to prompt_tokens a second time. With a warm cache
this roughly doubles reported prompt tokens (e.g. a real trace reported
75,387 prompt tokens for a 37,694-token request with 37,324 cache reads
and 369 cache writes).

Detect separate accounting arithmetically instead: only fold cache
tokens into prompt/total when they exceed the reported prompt total,
which is impossible under the UsageMetadata contract but is exactly the
inconsistency the original normalization (BT-5150) was added to repair.

Strengthen the VCR prompt-caching test to assert span prompt/total
tokens equal the usage_metadata the model reported, and add unit
coverage for the folded (Anthropic), subset (OpenAI), and separate
(legacy) conventions.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant