fix(langchain): stop double-counting anthropic cache tokens in prompt totals#504
Draft
bhaveshklaviyo wants to merge 1 commit into
Draft
Conversation
… totals langchain-anthropic has folded cache read/creation tokens into usage_metadata input_tokens since 0.2.3 (versions before that don't emit input_token_details at all), and langchain-aws does the same — per the langchain-core UsageMetadata contract, input_token_details is a breakdown of input_tokens, not an addition to it. The cache normalization from braintrustdata#411/braintrustdata#445 detected "separate cache token accounting" by the presence of cache_creation/ephemeral_* detail keys, which langchain-anthropic always emits, so every cached Anthropic call had cache tokens added to prompt_tokens a second time. With a warm cache this roughly doubles reported prompt tokens (e.g. a real trace reported 75,387 prompt tokens for a 37,694-token request with 37,324 cache reads and 369 cache writes). Detect separate accounting arithmetically instead: only fold cache tokens into prompt/total when they exceed the reported prompt total, which is impossible under the UsageMetadata contract but is exactly the inconsistency the original normalization (BT-5150) was added to repair. Strengthen the VCR prompt-caching test to assert span prompt/total tokens equal the usage_metadata the model reported, and add unit coverage for the folded (Anthropic), subset (OpenAI), and separate (legacy) conventions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BraintrustCallbackHandlerdouble-counts Anthropic prompt-cache tokens:prompt_tokenson every cached ChatAnthropic span is inflated by exactlycache_read + cache_creation, roughly 2× the real prompt size once a cache is warm. We noticed this because Braintrust token/cost numbers diverged from the same trace exported via OpenInference/OTel (e.g. a request with 37,694 input tokens — 37,324 of them cache reads, 369 cache writes — was reported as 75,387 prompt tokens).Root cause
The cache normalization added in #411 (and scoped in #445 via
_cache_tokens_are_separate_from_input_tokens) assumes Anthropic-style integrations report cache tokens separately frominput_tokens, detecting that convention by the presence ofcache_creation/ephemeral_*m_input_tokenskeys ininput_token_details.That premise doesn't hold for any version of langchain-anthropic that can reach this code path:
input_tokens(_create_usage_metadata: "Anthropic'sinput_tokensexcludes cached tokens, so we manually addcache_readandcache_creationtokens to get the true total"), while still emitting the detail keys that trip the heuristic.input_token_detailsat all, so the folding block never runs for those versions either.ChatBedrockConverse) and langchain-openai also fold cache tokens intoinput_tokens, per the langchain-coreUsageMetadatacontract (input_token_detailsis a breakdown ofinput_tokens, not an addition to it).So the heuristic fires on every cached Anthropic response and adds the cache tokens a second time. The existing VCR test didn't catch it because its assertion (
prompt_tokens >= cache_creation_tokens) holds for the doubled value too.Fix
Detect separate cache-token accounting arithmetically instead of by key presence: fold cache tokens into
prompt_tokens/total_tokensonly when cache tokens exceed the reported prompt total — impossible under theUsageMetadatacontract, but exactly the inconsistency ("cache creation tokens exceeded total tokens", BT-5150) that the original normalization was added to repair. This keeps #411's protection for any integration that genuinely reports uncached input only, keeps #445's OpenAI behavior, and keeps #455's TTL-split cache-creation metrics untouched.Tests
test_prompt_caching_tokensto assert spanprompt_tokens/total_tokensequal theusage_metadataLangChain reported (red onmain:assert 2170 == 1095with the existing cassettes; green with the fix).cache_read-only, restoring the fix(langchain): avoid double-counting cached input tokens #445 regression test removed in fix(langchain): preserve anthropic cache metrics #455), and separate (legacy — still normalized).nox -s "test_langchain(latest)"andnox -s "test_langchain(0.3.28)"pass;pylintand pre-commit hooks pass.🤖 Generated with Claude Code