Feature Request: Support for Financial RAG Dataset Eval Pipelines with Domain-Specific Scorers

## Summary

I've been building a financial RAG evaluation framework ([FinRAG-Eval](https://github.com/ruthwikchikoti/finrag-eval)) and ran into friction when trying to use `braintrust-sdk-python` for domain-specific eval pipelines over financial documents (10-Ks, earnings transcripts, SEC filings).

## Problem

When running evals on financial RAG outputs, the default scorer setup doesn't map well to domain-specific correctness signals. Specifically:

1. **No built-in support for numerical/unit-aware comparison** — financial answers often contain figures like `$3.2B` vs `3.2 billion`. Current string-match scorers treat these as mismatches.
2. **No dataset schema for context-grounded financial Q&A** — when loading eval datasets from Braintrust's dataset store, there's no documented convention for attaching retrieved context chunks (needed for faithfulness scoring).
3. **Custom scorer registration is verbose** — adding a domain scorer (e.g., a ROUGE-F1 scorer or a financial entity extractor) requires wrapping functions manually with no type hints or schema validation.

## Proposed Solution

- Add a `context` field to the standard `EvalCase` schema (alongside `input`, `expected`, `metadata`) so retrieved chunks can be passed through to scorers natively.
- Provide a `NumericEquivalenceScorer` that normalizes units (B/M/K, $, %) before comparing.
- Allow scorer registration via a decorator pattern (`@braintrust.scorer`) similar to how pytest fixtures work — this would reduce boilerplate significantly.

## Example Use Case

```python
@braintrust.scorer
def financial_faithfulness(output: str, context: list[str]) -> Score:
    # Check if numerical claims in output are grounded in context
    ...
    return Score(name="financial_faithfulness", score=0.87)

await Eval(
    "FinRAG-Eval",
    data=financial_qa_dataset,
    task=rag_pipeline,
    scores=[financial_faithfulness, NumericEquivalenceScorer()],
)
```

## Context

I'm building this as part of [finrag-eval](https://github.com/ruthwikchikoti/finrag-eval), a framework for evaluating LLM outputs over financial documents. Happy to contribute a PR for the scorer decorator pattern or the `context` field addition if the team is open to it.

## References
- [Braintrust Eval Docs](https://www.braintrust.dev/docs/guides/evals)
- Related: DeepEval's `LLMTestCase` has a `retrieval_context` field that works similarly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support for Financial RAG Dataset Eval Pipelines with Domain-Specific Scorers #492

Summary

Problem

Proposed Solution

Example Use Case

Context

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature Request: Support for Financial RAG Dataset Eval Pipelines with Domain-Specific Scorers #492

Description

Summary

Problem

Proposed Solution

Example Use Case

Context

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions