Skip to content

Feature Request: Support for Financial RAG Dataset Eval Pipelines with Domain-Specific Scorers #492

@Ruthwik-Data

Description

Summary

I've been building a financial RAG evaluation framework (FinRAG-Eval) and ran into friction when trying to use braintrust-sdk-python for domain-specific eval pipelines over financial documents (10-Ks, earnings transcripts, SEC filings).

Problem

When running evals on financial RAG outputs, the default scorer setup doesn't map well to domain-specific correctness signals. Specifically:

  1. No built-in support for numerical/unit-aware comparison — financial answers often contain figures like $3.2B vs 3.2 billion. Current string-match scorers treat these as mismatches.
  2. No dataset schema for context-grounded financial Q&A — when loading eval datasets from Braintrust's dataset store, there's no documented convention for attaching retrieved context chunks (needed for faithfulness scoring).
  3. Custom scorer registration is verbose — adding a domain scorer (e.g., a ROUGE-F1 scorer or a financial entity extractor) requires wrapping functions manually with no type hints or schema validation.

Proposed Solution

  • Add a context field to the standard EvalCase schema (alongside input, expected, metadata) so retrieved chunks can be passed through to scorers natively.
  • Provide a NumericEquivalenceScorer that normalizes units (B/M/K, $, %) before comparing.
  • Allow scorer registration via a decorator pattern (@braintrust.scorer) similar to how pytest fixtures work — this would reduce boilerplate significantly.

Example Use Case

@braintrust.scorer
def financial_faithfulness(output: str, context: list[str]) -> Score:
    # Check if numerical claims in output are grounded in context
    ...
    return Score(name="financial_faithfulness", score=0.87)

await Eval(
    "FinRAG-Eval",
    data=financial_qa_dataset,
    task=rag_pipeline,
    scores=[financial_faithfulness, NumericEquivalenceScorer()],
)

Context

I'm building this as part of finrag-eval, a framework for evaluating LLM outputs over financial documents. Happy to contribute a PR for the scorer decorator pattern or the context field addition if the team is open to it.

References

  • Braintrust Eval Docs
  • Related: DeepEval's LLMTestCase has a retrieval_context field that works similarly

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions