Benchmarking spec-kit presets — how are you measuring quality? #2355

kennedy-whytech · 2026-04-24T16:26:28Z

kennedy-whytech
Apr 24, 2026

Curious how people are thinking about preset quality and spec-kit in general.

Right now there's no standard way to answer: did this preset do a good job? You can eyeball the output, but that doesn't scale across preset versions, contributors, or different AI models.

The question: how are people measuring this today? Token spend, output correctness, coverage of the spec, something else?

What if spec-kit shipped a standard benchmark harness — a fixed set of input specs and deterministic pass/fail criteria — so any preset/spec-kit core can be scored consistently?

The shape I'm imagining:

Standard inputs — a small library of spec-input.md fixtures covering common cases (CRUD API, CLI tool, data pipeline, etc.)
Pass/fail criteria — measurable outcomes: did the output compile/run? does it cover all user stories in the spec? does the audit stage pass?
Token metrics — input/output per phase (specify → plan → tasks → implement → audit), total cost in USD/pure token count
Result format — a JSON record per run so results are comparable across versions and models

Important: this isn't a gate. The benchmark wouldn't block PRs or enforce a minimum score — it's purely a reference. The goal is to give preset authors and users a shared understanding of the potential impact of a change: does this new preset version cost significantly more tokens? Does it cover more of the spec? That kind of visibility is useful even when the answer is "it regressed slightly but the tradeoff is worth it."

Is this something the community wants? And if so, what fixtures or criteria would actually be meaningful to you?

mnriem · 2026-04-24T16:30:27Z

mnriem
Apr 24, 2026
Maintainer

Do you think it would extend beyond presets to the extensions as well as the core commands?

3 replies

kennedy-whytech Apr 24, 2026
Author

yes, i feel like it's more like an overall benchmark actually, since each part stage(specify/plan/implement...) is sort of dynamically depending on each other. Having said that, model is never deterministic. We might need some experiments for that.

kennedy-whytech Apr 24, 2026
Author

I did try to create a yolo command to forcefully go through all the stages automatically for benchmarking, but it was too token consuming. I gave up haha.

coderandhiker Apr 24, 2026

@kennedy-whytech I've actually built some of the items you've proposed here and did a preliminary analysis of the standard vs. lean presets: token burn, accuracy scoring with a rubric, etc. Let me get that code in shape to share next week and I'll ping you for a review.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking spec-kit presets — how are you measuring quality? #2355

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Benchmarking spec-kit presets — how are you measuring quality? #2355

Uh oh!

Uh oh!

kennedy-whytech Apr 24, 2026

Replies: 1 comment · 3 replies

Uh oh!

mnriem Apr 24, 2026 Maintainer

Uh oh!

Uh oh!

kennedy-whytech Apr 24, 2026 Author

Uh oh!

Uh oh!

kennedy-whytech Apr 24, 2026 Author

Uh oh!

coderandhiker Apr 24, 2026

kennedy-whytech
Apr 24, 2026

Replies: 1 comment 3 replies

mnriem
Apr 24, 2026
Maintainer

kennedy-whytech Apr 24, 2026
Author

kennedy-whytech Apr 24, 2026
Author