Skip to content

Site tier-1 follow-up: per-model deep-dive page#23

Closed
MaxGhenis wants to merge 1 commit into
mainfrom
model-deepdive-page
Closed

Site tier-1 follow-up: per-model deep-dive page#23
MaxGhenis wants to merge 1 commit into
mainfrom
model-deepdive-page

Conversation

@MaxGhenis

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #8 and #9. Adds a statically-generated per-model deep-dive page at /model/[id] — one page per model present in data.json.

Rendered sections

1. Headline strip (inside SiteHeader expandedContent, alwaysExpanded)

  • Provider mark (ProviderMark) + model name + provider label
  • Score pills: Global, US, UK (from globalStat.countryScores), Parse rate (nParsed / n)

2. Hardest outputs — top 5 lowest-scoring output groups for this model

  • Aggregated at (country, outputGroup) level using buildAllRows + scorePrediction from lib/sensitivity.ts / lib/scoring.ts
  • The aggregation mirrors the 3-level mean in scoresPerCountryModel: per-row scores → output-group mean → displayed score
  • Shows variable label (getVariableLabel), country tag, and a Badge (same color thresholds as ModelLeaderboard)

3. Sample wrong predictions — up to 10 distinct (country, scenario, variable) cells where relErr > 10% and score < 0.75

  • Sorted by largest relative error first
  • Each card: country tag, variable label, score badge, Prediction / Ground truth / Error columns (currency-formatted for amount outputs, integer for binary)
  • Collapsible <details> block with the model's explanation text
  • Link to /#scenarios for the scenario explorer, plus the scenario ID

4. Back to leaderboard link at page bottom

Static routes generation

generateStaticParams collects all model IDs from dashboard.global.modelStats and the union of country-level modelStats, returning one { id } entry per model. The current data produces 12 static routes:

/model/gpt-5.5
/model/claude-sonnet-4.6
/model/claude-opus-4.7
/model/grok-4.20
/model/gemini-3.1-pro-preview
/model/gemini-3-flash-preview
/model/grok-4.3
/model/gemini-3.1-flash-lite-preview
/model/gpt-5.4-mini
/model/grok-4.1-fast
/model/claude-haiku-4.5
/model/gpt-5.4-nano

Library reuse

Library Used for
lib/scoring.tsscorePrediction, metricTypeForVariable Per-row score computation, metric type for display formatting
lib/sensitivity.tsbuildAllRows Builds the full ScoreRow[] for all countries, filtered to the model
lib/bootstrap.ts Not needed for static server render; omitted

Scoring math

For each (country, outputGroup) pair, the displayed score is the mean of per-row scores (each scorePrediction result × 100) across all scenarios and person-expanded variables that map to that output group. This is equivalent to the inner two levels of the 3-level mean in scoresPerCountryModel.

Smoke test

bun run lint   # clean (0 errors, 0 warnings)
bun run build  # clean — /model/[id] SSG route with 12 paths in build output

Build output excerpt:

● /model/[id]
│ ├ /model/gpt-5.5
│ ├ /model/claude-sonnet-4.6
│ ├ /model/claude-opus-4.7
│ └ [+9 more paths]

Test plan

  • CI passes
  • Visit /model/gpt-5.5 — headline shows Global / US / UK scores, parse rate
  • Verify "Top 5 lowest-scoring outputs" renders 5 rows with country tags and score badges
  • Verify "Sample errors" section renders cards with prediction / ground truth / error
  • Expand a model explanation <details> block
  • Click "View in scenario explorer →" — lands on /#scenarios
  • Click "← Back to leaderboard" — returns to /
  • Visit /model/nonexistent-model — returns 404
  • Mobile width — score pills wrap cleanly under provider mark

🤖 Generated with Claude Code

Scheduled follow-up agent — opened after confirming both #8 and #9 are merged.


Generated by Claude Code

Statically generates a dedicated page for each of the 12 models in
data.json, using generateStaticParams so the entire site stays a
pure static export.

Each page renders:
- Headline strip: provider mark, model name, global/US/UK scores,
  parse-rate pill — all sourced from globalStat.countryScores.
- Hardest outputs: top-5 lowest-scoring output groups (country × outputGroup)
  computed by reusing buildAllRows/scorePrediction from lib/sensitivity.ts
  and lib/scoring.ts, aggregated the same way as the headline scorer.
- Sample wrong predictions: up to 10 (scenario, variable) cells where
  relErr > 10% and score < 0.75, sorted by largest relative error,
  with prediction / ground-truth / error columns plus a collapsible
  model explanation and a link back to /#scenarios.
- Back to leaderboard link.

Reuses SiteHeader (alwaysExpanded + actionLink back to /),
the Badge color scheme from ModelLeaderboard, and Tailwind v4
design-token classes throughout.

Build smoke-test: `bun run build` produces the /model/[id] SSG route
with all 12 model paths; `bun run lint` is clean.

https://claude.ai/code/session_01DS3KJmEye7o7ff18RdthTC
@vercel

vercel Bot commented May 16, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policybench-site Ready Ready Preview, Comment May 16, 2026 1:11pm

Request Review

@MaxGhenis

Copy link
Copy Markdown
Contributor Author

Superseded by #70, which rebuilds the per-model deep-dive on the current stack (post-#58 metrics, post-#61 split-data import) and adds explorer deep links from each hardest case. Closing.

@MaxGhenis MaxGhenis closed this Jun 10, 2026
MaxGhenis added a commit that referenced this pull request Jun 10, 2026
Every model gets a statically generated page at /model/[id]: country
ranks and headline pills, per-program scores sorted hardest-first,
binary eligibility-flag accuracy computed from prediction rows, and the
model's worst misses on positive references, each linking into the
scenario explorer. Pages are server components over the bundled
summary, so they ship no additional client JS, and each carries its own
metadata for social previews. Leaderboard rows and the explorer's
detail dialog link to them; the sitemap lists them.

The explorer now mirrors its state into the URL (?scenario=...,
?cell=variable~model) via replaceState, applies deep links on mount
(turning off the frontier-only filter when the linked model needs it),
and clears both params on country switch since ids are
country-specific.

Deep-linking exposed a latent UX bug: the explanation sidecar fetch was
viewport-gated, but an open dialog makes the background inert, so a
deep-linked dialog could never trigger the fetch and showed "Loading
explanation text" forever. Opening any detail dialog now starts the
fetch directly.

Supersedes the stale draft #23, rebuilt on the split-data stack.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
MaxGhenis added a commit that referenced this pull request Jun 10, 2026
* Add per-model pages and explorer deep links

Every model gets a statically generated page at /model/[id]: country
ranks and headline pills, per-program scores sorted hardest-first,
binary eligibility-flag accuracy computed from prediction rows, and the
model's worst misses on positive references, each linking into the
scenario explorer. Pages are server components over the bundled
summary, so they ship no additional client JS, and each carries its own
metadata for social previews. Leaderboard rows and the explorer's
detail dialog link to them; the sitemap lists them.

The explorer now mirrors its state into the URL (?scenario=...,
?cell=variable~model) via replaceState, applies deep links on mount
(turning off the frontier-only filter when the linked model needs it),
and clears both params on country switch since ids are
country-specific.

Deep-linking exposed a latent UX bug: the explanation sidecar fetch was
viewport-gated, but an open dialog makes the background inert, so a
deep-linked dialog could never trigger the fetch and showed "Loading
explanation text" forever. Opening any detail dialog now starts the
fetch directly.

Supersedes the stale draft #23, rebuilt on the split-data stack.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Trigger CI after retarget to main

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants