feat: add check-grader-scores post-processor#1191
Merged
Conversation
Adds scripts/check-grader-scores.ts — a manual e2e tool that walks examples/**/*.grader-scores.yaml, reads the sibling *.results.jsonl produced by a prior agentv eval --out run, and asserts each (test_id, grader, range) tuple matches expected score ranges. Ships two grader-scores sidecar files: - examples/red-team/archetypes/coding-agent/suites/screenshot-pii-upload.grader-scores.yaml - examples/features/rubric/evals/dataset.grader-scores.yaml Also updates AGENTS.md with a workflow section and gitignores examples/**/*.results.jsonl. Closes #1190 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4 tasks
Deploying agentv with
|
| Latest commit: |
b853430
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://9d9866bf.agentv.pages.dev |
| Branch Preview URL: | https://feat-1190-check-grader-score.agentv.pages.dev |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
scripts/check-grader-scores.ts— a manual e2e post-processor that walksexamples/**/*.grader-scores.yaml, finds the sibling*.results.jsonlproduced by a prioragentv eval --outrun, and asserts each(test_id, grader, range)tuple falls within the expected score range.Motivation from #1185: the
screenshot-pii-uploadred-team suite has test cases expected to fail against frontier models. Without this script, grader regressions (false positives / false negatives) are invisible in CI.Changes
scripts/check-grader-scores.ts— auto-discovers grader-scores files, reads JSONL, asserts ranges, exits non-zero on failureexamples/red-team/archetypes/coding-agent/suites/screenshot-pii-upload.grader-scores.yaml— anchors the format on feat(red-team): screenshot PII upload eval for coding agents #1185 evalsexamples/features/rubric/evals/dataset.grader-scores.yaml— proves the tool is generalAGENTS.md— adds a "Checking Grader Score Ranges" workflow section.gitignore— ignoresexamples/**/*.results.jsonl(runtime artifacts)No changes to
packages/core, no schema changes, no changes to existing eval YAMLs.Manual UAT Evidence
Green (all pass — using baseline JSONL as results)
Red (mutated
rubricsscore to 0.9 onno-financial-figures-verbatim-in-issue-body)Closes #1190