Skip to content

Add local trace replay regression harness#494

Draft
Ritwij Aryan Parmar (RitwijParmar) wants to merge 1 commit into
braintrustdata:mainfrom
RitwijParmar:codex/braintrust-trace-replay-regression
Draft

Add local trace replay regression harness#494
Ritwij Aryan Parmar (RitwijParmar) wants to merge 1 commit into
braintrustdata:mainfrom
RitwijParmar:codex/braintrust-trace-replay-regression

Conversation

@RitwijParmar

Copy link
Copy Markdown

Summary

This adds a local trace replay path for the Python SDK. The goal is to make saved Braintrust trace exports useful as regression cases when iterating on an agent/task or scorer, without creating a new experiment just to sanity-check behavior.

What changed:

  • added braintrust replay for JSON/JSONL span exports
  • added ReplayTrace so replayed scorers can inspect spans with get_spans() and get_thread()-style access
  • reports current scores, baseline root-span scores, score deltas, derived trace metrics, and metric deltas
  • added CI-oriented gates: --min-score, --min-score-delta, and --fail-on-error
  • documented the workflow in the Python README

Why

For agent/eval workflows, production traces often capture the hard cases: tool-call paths, bad intermediate states, or regressions that unit fixtures miss. This gives users a lightweight way to replay those traces locally and fail a check when a task or scorer change regresses against the saved baseline.

Tests

  • PYTHONPATH=py/src .venv/bin/python -m pytest py/src/braintrust/test_trace_replay.py -q
  • PYTHONPATH=py/src .venv/bin/python -m pytest py/src/braintrust/cli/test_push.py py/src/braintrust/test_trace_replay.py -q
  • PYTHONPATH=py/src .venv/bin/python -m compileall -q py/src/braintrust/trace_replay.py py/src/braintrust/test_trace_replay.py py/src/braintrust/cli/__main__.py py/src/braintrust/__init__.py
  • git diff --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant