Skip to content

✨ Add verify-deliverables SubagentStop hook#54

Merged
TechNickAI merged 2 commits intomainfrom
feat/verify-deliverables-hook
May 4, 2026
Merged

✨ Add verify-deliverables SubagentStop hook#54
TechNickAI merged 2 commits intomainfrom
feat/verify-deliverables-hook

Conversation

@TechNickAI
Copy link
Copy Markdown
Owner

Summary

Adds a SubagentStop lifecycle hook that cross-references a subagent's recent activity against its completion claims. When the tool-call record or filesystem disagree with what the subagent said it did, the hook surfaces a structured advisory back to the parent agent via hookSpecificOutput.additionalContext — never blocking.

This shifts the "did the subagent actually do what it claimed?" check from prompt discipline (where it lives today across <epistemic-honesty>, the verify-fix skill, and the verification-before-completion superpower) to a structural lifecycle hook the harness enforces.

Fixes #51

What it checks

Check A — completion claim with zero mutations. If the subagent's last message contains strong completion language (done., completed., , implemented, fixed it, all tests pass, etc.) plus a file-path-shaped token, but the recent transcript shows zero Write/Edit/MultiEdit/NotebookEdit calls — flag.

Check B — recorded tool call but file missing. For each path in a Write/Edit-style tool call's file_path, verify it exists on disk now. A missing file means the call failed silently or the artifact was lost.

Design decisions

  • Bash + jq, not Node. Matches the existing session-multiplexing.sh and todo-persist.sh style. Easier to maintain, no runtime deps beyond what other hooks already require.
  • Always advisory, never blocks. Output is appended via additionalContext; the parent agent decides what to do. A misbehaving hook that blocks would be worse than one that misses a lie.
  • Conservative regex on completion phrases. Patterns like done\., completed\., fixed it, all tests pass(ed)? are anchored with leading and trailing non-letter boundaries. A negation guard rejects "not all tests pass". Tests cover the "fixed itself" / negation regressions.
  • Cross-references tool calls, not prose claims. OMC's prior art uses a static deliverables.json config per stage; this approach uses the actual transcript so it works on any subagent without per-skill configuration.
  • 200-line transcript window (was 50 in the first draft) — handles subagents that ran many Read calls between an early Write and a final summary.
  • Fail-open in every error path. Missing jq, unreadable transcript, malformed JSON, jq OOM at the final emit step — all exit 0 silently.
  • Escape hatches: respects stop_hook_active=true to prevent loops; VERIFY_DELIVERABLES_DISABLED=1 env var disables entirely.

Files

  • plugins/core/hooks/verify-deliverables.sh — the hook
  • plugins/core/hooks/tests/test-verify-deliverables.sh — 18-case test harness
  • plugins/core/hooks/hooks.json — registers the hook against SubagentStop
  • plugins/core/hooks/README.md — documents the hooks directory + this hook's design
  • plugins/core/.claude-plugin/plugin.json — version 9.20.0 → 9.21.0
  • .claude-plugin/marketplace.json — version 9.20.0 → 9.21.0 (both fields)

Complexity

balanced — multi-file, new pattern in this repo (first SubagentStop hook), but well-bounded. Multi-review surfaced and fixed real issues before push.

Validation

  • 18 tests pass: honest writes, MultiEdit, alt payload shape, pure analyst messages, negated claims, trailing-boundary regressions ("fixed itself" not matching "fixed it"), broken symlinks, long transcripts (Write at line 1, message at line 60+), relative paths resolved against cwd, escape hatch, kill switch, plus several fail-open cases.
  • jq . hooks.json validates.
  • Version sync confirmed across all three locations per .claude-plugin/CLAUDE.md.
  • Multi-review by robustness, security, and logic reviewers; all must-fix and bug findings addressed:
    • Added command -v jq guard at script entry
    • Wrapped final jq -n so a jq failure can't leak non-zero exit
    • Bumped tail window 50 → 200
    • Tightened completion regex with trailing boundary
    • Added negation guard
    • Added broken-symlink handling (-L test)
    • Strip embedded newlines from extracted file paths in jq

Out of scope (per issue)

  • Verifying correctness of artifacts — that's verify-fix's job. This hook only checks existence.
  • Reading external project-level overrides for verification rules — kept simple; project-specific tuning can be revisited if needed.

Cross-references a subagent's recent activity against its completion
claims. Fires advisory output via hookSpecificOutput.additionalContext
when the tool-call record or filesystem disagrees with what the
subagent said it did.

Two checks:
- strong completion claim + zero Write/Edit/MultiEdit/NotebookEdit
  calls + a file-path-shaped token in the message → flag
- a Write/Edit-style tool call's file_path is missing on disk → flag

Always advisory, never blocks. Fail-open in every error path so a hook
bug cannot break the harness. Honors stop_hook_active to avoid loops
and a VERIFY_DELIVERABLES_DISABLED=1 env kill switch.

Test harness covers 18 cases including honest writes, MultiEdit,
alt payload shape, negated phrasings, trailing-boundary regressions,
broken symlinks, long transcripts, relative paths, the escape hatch,
and several fail-open paths.

Bumps plugin version to 9.21.0 across marketplace.json and plugin.json.

Fixes #51
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5a4687901b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

INPUT=$(cat)
[ -z "$INPUT" ] && exit 0

TRANSCRIPT_PATH=$(echo "$INPUT" | jq -r '.transcript_path // empty' 2>/dev/null)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Read subagent transcript instead of parent transcript

This hook is registered on SubagentStop, but it reads .transcript_path, which for this event is the main session transcript rather than the subagent’s own log; the subagent tool calls and final message live under .agent_transcript_path. As written, the checks can miss real subagent write/edit activity (false “no file changes”) and fail to verify the files the subagent actually touched, so the deliverable validation is systematically unreliable for the event it targets.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical catch — thank you. The docs confirm SubagentStop provides both transcript_path (main session) and agent_transcript_path (subagent's own log). Fixed in 29a6e0a: the hook now reads agent_transcript_path first and falls back to transcript_path for older harness versions or test fixtures. Also wired in last_assistant_message from the payload as a direct shortcut for the final message check, avoiding JSONL parsing when the harness provides it.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect — line 30 already reads agent_transcript_path first and falls back to transcript_path only when the subagent field is absent. The code handles this correctly.

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Code Review

Summary: This is a well-engineered PR. The fail-open design is correct, the test coverage is solid (18 cases), and the advisory-only output model is the right call. A few issues worth addressing before merge.


Bug: TMPDIR shadowing in test harness

test-verify-deliverables.sh line 21:

TMPDIR=$(mktemp -d)

TMPDIR is a reserved environment variable that mktemp itself reads to determine where to create temp files. Shadowing it exports that new directory as the temp root for all child processes. Any tool called from within the test that internally calls mktemp (jq on some platforms, bash subshells) would create files inside the test directory, which then gets rm -rf'd on EXIT. On macOS this is benign; on Linux it can cause surprising behavior.

Use a different name: TEST_DIR=$(mktemp -d) and update the three references.


Bug: n.?t in negation guard matches unintended substrings

verify-deliverables.sh line ~108:

grep -qiE '(not|n.?t|cannot|couldn.?t|...)...'

n.?t (zero-or-one wildcard) matches nt as well as n't — so it also matches inside "count", "front", "hunt", "sent", "mount", "want", etc. Combined with the context constraint (negation word + whitespace + completion word), false positives are rare but possible. The intent is almost certainly n't for general contractions.

Simpler and correct:

grep -qiE "(not|n't|cannot|couldn't|didn't|haven't|hasn't|wasn't)..."

Since the script already uses -E (ERE), the ' literal is fine inside the pattern.


Missing test: NotebookEdit as a mutation type

The hook counts NotebookEdit alongside Write/Edit/MultiEdit as a mutating call, but there's no test case that sends a NotebookEdit tool call and verifies it suppresses the advisory. The MultiEdit case is tested; adding one for NotebookEdit would complete the coverage.


Nit: printf "%b" interprets escape sequences in file paths

verify-deliverables.sh line ~125:

MESSAGE=$(printf "...\n%bReview..." "$ISSUES")

%b expands backslash sequences in its argument. $ISSUES is built from extracted file paths, so a path containing \t, \n, etc. (unusual but valid on Linux) would be silently mangled. Using actual newlines in $ISSUES and printf "%s" is safer:

ISSUES="${ISSUES}- message here"$'\n'
...
MESSAGE="[verify-deliverables] ...$'\n'${ISSUES}Review..."

Or just use echo -e / printf with a literal newline. Low-probability failure mode but worth closing.


Minor: ✅ emoji in grep -E requires UTF-8 locale

The completion-phrase regex includes a raw character. grep -E with multi-byte emoji depends on the runtime locale being UTF-8. On minimal CI environments or systems with LC_ALL=C, the emoji byte sequence may not match. Adding LC_ALL=C.UTF-8 before the grep calls (or testing separately with grep -P or printf '\xe2\x9c\x85') would make this portable. Alternatively, drop the emoji from the pattern if the prose patterns are sufficient — the test suite has no case covering ✅ detection anyway.


The design is sound

The 200-line transcript window (vs. the original 50), the stop_hook_active escape hatch, the -L symlink check, the trailing-boundary regex to prevent "fixed itself" false positives, and the jq -n wrapper around the final emit are all good choices that I'd have surfaced in review if they weren't already there. The hooks.json registration is syntactically valid and matches the existing convention.

Comment thread plugins/core/hooks/verify-deliverables.sh Outdated
- Read agent_transcript_path (subagent's own log) instead of
  transcript_path (parent session); add last_assistant_message shortcut
- Fix n.?t → n't in negation guard (n.?t matched 'nt' inside 'count',
  'front', 'environment', etc.)
- Replace TMPDIR with TEST_DIR in test harness (TMPDIR is a reserved
  env var that mktemp reads internally)
- Add NotebookEdit test case to complete mutation-type coverage
- Switch ISSUES separator from literal \n to $'\n'; drop printf %b in
  MESSAGE to prevent backslash interpretation in file paths
- Remove ✅ from completion pattern (locale portability; prose patterns
  are sufficient)
- Payload now includes agent_transcript_path in test fixtures

All 19 test cases pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@TechNickAI
Copy link
Copy Markdown
Owner Author

Fixed all five issues in 29a6e0a:

  • TMPDIR shadowing: renamed to TEST_DIR throughout the test harness
  • n.?t: changed to n't and switched grep to double-quotes; also removed from the negation check since it was dropped from the positive pattern
  • NotebookEdit test: added fixture + run_case — 19 tests now, all passing
  • printf %b: ISSUES entries now use $'\n' separators; MESSAGE built with a literal newline so %b never touches file-path content
  • ✅ locale: removed from the completion pattern; the prose patterns (done., finished., implemented, etc.) cover the same signal without the UTF-8 dependency

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Code Review — PR #54: Add verify-deliverables SubagentStop hook

Overview

This PR adds the repo's first SubagentStop lifecycle hook, shifting the "did the subagent actually do what it claimed?" check from prompt discipline to a structural harness enforcement point. The design is sound: advisory-only, fail-open, with a kill switch and escape hatches. Overall this is well-considered work.


✅ Strengths

  • Fail-open is consistently enforced — every error path exits 0: missing jq, empty input, unreadable transcript, jq OOM at emit. This is exactly right for a hook.
  • Test coverage is solid — 18 cases covering honest writes, MultiEdit, alt payload shapes, negated claims, trailing-boundary regressions, broken symlinks, long transcripts, relative paths, escape hatch, kill switch, and all fail-open paths.
  • The jq guard at script entry is a good addition (noted as a prior-review fix). Prevents a confusing failure mode on minimal environments.
  • 200-line transcript window is a reasonable bound; the prior 50-line window would have missed the long-transcript case documented in the test.
  • README.md is well-written and serves as a good template for future hooks.
  • Version sync is correct across plugin.json and marketplace.json.

🟡 Issues Worth Addressing

1. implemented pattern has an unreachable trailing boundary (false-negative bug)

In verify-deliverables.sh, the completion regex:

(^|[^a-z])(done\.|...|implemented |...)([ ^a-zA-Z]|$)

The implemented alternative already ends with a literal space. The trailing boundary group ([^a-zA-Z]|$) then requires the next character after that space to be a non-letter or end of string. Since English sentences have implemented <noun> (where <noun> starts with a letter), this means almost no real "implemented X" sentence will trigger via this branch.

Example: "Implemented the changes in src/foo.ts."implemented matches the characters, but then t (from the) fails [^a-zA-Z], so the whole pattern misses.

In practice this is masked because done. catches most liar messages, but the implemented branch is nearly dead code. Consider:

  • implemented[^a-zA-Z] (word boundary approach, no trailing space), or
  • Drop the trailing boundary from space-terminated alternatives and rely only on the leading boundary.

This is a false-negative (misses some liars), not a false-positive — consistent with the stated philosophy of "rather miss a few lies than nag honest responses" — but it's worth knowing the pattern doesn't work as written.

2. Same issue applies to several other space-terminated alternatives

created the , wrote the , added the , saved the all have the same structural problem: the trailing boundary ([^a-zA-Z]|$) applies after the space, so created the file.ts → after created the , the next char is f (letter) → no match. The only patterns that work correctly are the ones ending in . (like done., completed.) since ([^a-zA-Z]|$) on $ or non-letter after . works fine.

3. CALLED_PATHS here-doc iterates an empty string when paths contain only whitespace

if [ -n "$CALLED_PATHS" ]; then
    while IFS= read -r path; do
        [ -z "$path" ] && continue

The [ -z "$path" ] && continue guard handles this correctly. No change needed, but worth a comment that the empty-line guard is load-bearing.


🔵 Minor / Nits

  • Multiple echo "$RECENT" | jq calls (lines ~70, ~80, ~90) scan the same 200-line slice three times. Negligible at this scale, but a single jq pass extracting all three signals would be cleaner if the script ever grows.
  • Test runner uses bash "$HOOK" rather than executing it directly — fine, but if the shebang were ever changed or the script needed to test for interpreter errors, direct execution would catch more.
  • run_case in the test harness merges stdout and stderr (2>&1). If the hook ever writes intentional diagnostics to stderr, this would create spurious failures. Separating them would be more robust, but is acceptable for now.

Summary

Category Finding
Logic implemented , created the , wrote the , added the , saved the patterns are effectively dead due to trailing-boundary misapplication
Security No concerns — existence-check only, fail-open
Performance Fine — bounded transcript window, jq is lightweight here
Tests Strong — 18 cases with good edge coverage
Conventions Consistent with existing hooks, README follows project style

The must-fix is the regex dead-code issue (#1/#2 above), which is a correctness gap in the stated feature. Everything else is minor. The architectural approach and fail-open discipline are solid — this is a good addition to the hooks system.

🤖 Generated with Claude Code

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 29a6e0a. Configure here.

# false negatives, so we only fire on strong signals. Trailing
# boundary keeps "fixed it" from matching inside "fixed itself"
# and "all tests pass" from matching inside "passenger".
if echo "$LAST_MSG" | grep -qiE '(^|[^a-z])(done\.|complete\.|completed\.|finished\.|implemented |fixed it|created the |wrote the |added the |saved the |all tests pass(ed)?)([^a-zA-Z]|$)'; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing boundary breaks space-ending completion patterns

Medium Severity

The trailing boundary ([^a-zA-Z]|$) is applied uniformly to all alternatives in the completion regex. For patterns ending with a space — implemented , created the , wrote the , added the , saved the — this boundary checks the character after the trailing space, which in natural English is almost always a letter (e.g., "Implemented the fix"). These five patterns therefore almost never match. The boundary was needed only for fixed it (to avoid matching fixed itself) but inadvertently silences half the completion vocabulary. The liar test case is masked because it uses "Done. Implemented…", which is caught by done\. instead.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 29a6e0a. Configure here.

@TechNickAI TechNickAI merged commit 54700e9 into main May 4, 2026
9 checks passed
@TechNickAI TechNickAI deleted the feat/verify-deliverables-hook branch May 4, 2026 23:00
@TechNickAI
Copy link
Copy Markdown
Owner Author

Great catch on the trailing boundary issue (#1/#2). The space-ending patterns (implemented , created the , wrote the , added the , saved the ) are effectively dead code due to ([^a-zA-Z]|$) requiring a non-letter after the trailing space. Addressing in a follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add verify-deliverables SubagentStop hook to harden subagent completion claims

1 participant