Summary
I've been building a financial RAG evaluation framework (FinRAG-Eval) and ran into friction when trying to use braintrust-sdk-python for domain-specific eval pipelines over financial documents (10-Ks, earnings transcripts, SEC filings).
Problem
When running evals on financial RAG outputs, the default scorer setup doesn't map well to domain-specific correctness signals. Specifically:
- No built-in support for numerical/unit-aware comparison — financial answers often contain figures like
$3.2B vs 3.2 billion. Current string-match scorers treat these as mismatches.
- No dataset schema for context-grounded financial Q&A — when loading eval datasets from Braintrust's dataset store, there's no documented convention for attaching retrieved context chunks (needed for faithfulness scoring).
- Custom scorer registration is verbose — adding a domain scorer (e.g., a ROUGE-F1 scorer or a financial entity extractor) requires wrapping functions manually with no type hints or schema validation.
Proposed Solution
- Add a
context field to the standard EvalCase schema (alongside input, expected, metadata) so retrieved chunks can be passed through to scorers natively.
- Provide a
NumericEquivalenceScorer that normalizes units (B/M/K, $, %) before comparing.
- Allow scorer registration via a decorator pattern (
@braintrust.scorer) similar to how pytest fixtures work — this would reduce boilerplate significantly.
Example Use Case
@braintrust.scorer
def financial_faithfulness(output: str, context: list[str]) -> Score:
# Check if numerical claims in output are grounded in context
...
return Score(name="financial_faithfulness", score=0.87)
await Eval(
"FinRAG-Eval",
data=financial_qa_dataset,
task=rag_pipeline,
scores=[financial_faithfulness, NumericEquivalenceScorer()],
)
Context
I'm building this as part of finrag-eval, a framework for evaluating LLM outputs over financial documents. Happy to contribute a PR for the scorer decorator pattern or the context field addition if the team is open to it.
References
- Braintrust Eval Docs
- Related: DeepEval's
LLMTestCase has a retrieval_context field that works similarly
Summary
I've been building a financial RAG evaluation framework (FinRAG-Eval) and ran into friction when trying to use
braintrust-sdk-pythonfor domain-specific eval pipelines over financial documents (10-Ks, earnings transcripts, SEC filings).Problem
When running evals on financial RAG outputs, the default scorer setup doesn't map well to domain-specific correctness signals. Specifically:
$3.2Bvs3.2 billion. Current string-match scorers treat these as mismatches.Proposed Solution
contextfield to the standardEvalCaseschema (alongsideinput,expected,metadata) so retrieved chunks can be passed through to scorers natively.NumericEquivalenceScorerthat normalizes units (B/M/K, $, %) before comparing.@braintrust.scorer) similar to how pytest fixtures work — this would reduce boilerplate significantly.Example Use Case
Context
I'm building this as part of finrag-eval, a framework for evaluating LLM outputs over financial documents. Happy to contribute a PR for the scorer decorator pattern or the
contextfield addition if the team is open to it.References
LLMTestCasehas aretrieval_contextfield that works similarly