Benchmarking spec-kit presets — how are you measuring quality? #2355
kennedy-whytech
started this conversation in
General
Replies: 1 comment 3 replies
-
|
Do you think it would extend beyond presets to the extensions as well as the core commands? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Curious how people are thinking about preset quality and spec-kit in general.
Right now there's no standard way to answer: did this preset do a good job? You can eyeball the output, but that doesn't scale across preset versions, contributors, or different AI models.
The question: how are people measuring this today? Token spend, output correctness, coverage of the spec, something else?
What if spec-kit shipped a standard benchmark harness — a fixed set of input specs and deterministic pass/fail criteria — so any preset/spec-kit core can be scored consistently?
The shape I'm imagining:
spec-input.mdfixtures covering common cases (CRUD API, CLI tool, data pipeline, etc.)Important: this isn't a gate. The benchmark wouldn't block PRs or enforce a minimum score — it's purely a reference. The goal is to give preset authors and users a shared understanding of the potential impact of a change: does this new preset version cost significantly more tokens? Does it cover more of the spec? That kind of visibility is useful even when the answer is "it regressed slightly but the tradeoff is worth it."
Is this something the community wants? And if so, what fixtures or criteria would actually be meaningful to you?
Beta Was this translation helpful? Give feedback.
All reactions