From 5a4687901b963387d30308e1027b1ac7366bb71e Mon Sep 17 00:00:00 2001 From: Nick Sullivan Date: Mon, 4 May 2026 15:30:03 -0500 Subject: [PATCH 1/2] =?UTF-8?q?=E2=9C=A8=20Add=20verify-deliverables=20Sub?= =?UTF-8?q?agentStop=20hook?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cross-references a subagent's recent activity against its completion claims. Fires advisory output via hookSpecificOutput.additionalContext when the tool-call record or filesystem disagrees with what the subagent said it did. Two checks: - strong completion claim + zero Write/Edit/MultiEdit/NotebookEdit calls + a file-path-shaped token in the message → flag - a Write/Edit-style tool call's file_path is missing on disk → flag Always advisory, never blocks. Fail-open in every error path so a hook bug cannot break the harness. Honors stop_hook_active to avoid loops and a VERIFY_DELIVERABLES_DISABLED=1 env kill switch. Test harness covers 18 cases including honest writes, MultiEdit, alt payload shape, negated phrasings, trailing-boundary regressions, broken symlinks, long transcripts, relative paths, the escape hatch, and several fail-open paths. Bumps plugin version to 9.21.0 across marketplace.json and plugin.json. Fixes #51 --- .claude-plugin/marketplace.json | 4 +- plugins/core/.claude-plugin/plugin.json | 2 +- plugins/core/hooks/README.md | 85 +++++++ plugins/core/hooks/hooks.json | 11 + .../hooks/tests/test-verify-deliverables.sh | 215 ++++++++++++++++++ plugins/core/hooks/verify-deliverables.sh | 132 +++++++++++ 6 files changed, 446 insertions(+), 3 deletions(-) create mode 100644 plugins/core/hooks/README.md create mode 100755 plugins/core/hooks/tests/test-verify-deliverables.sh create mode 100755 plugins/core/hooks/verify-deliverables.sh diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index 7ef97f4..4d7e2ba 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -6,7 +6,7 @@ }, "metadata": { "description": "Professional AI coding configurations, agents, skills, and context for Claude Code and Cursor", - "version": "9.20.0", + "version": "9.21.0", "license": "MIT", "repository": "https://github.com/TechNickAI/ai-coding-config" }, @@ -15,7 +15,7 @@ "name": "ai-coding-config", "source": "./plugins/core", "description": "Commands, agents, skills, and context for AI-assisted development workflows", - "version": "9.20.0", + "version": "9.21.0", "tags": ["commands", "agents", "skills", "workflows", "essential"] } ] diff --git a/plugins/core/.claude-plugin/plugin.json b/plugins/core/.claude-plugin/plugin.json index c1e4583..6d4fcaf 100644 --- a/plugins/core/.claude-plugin/plugin.json +++ b/plugins/core/.claude-plugin/plugin.json @@ -1,6 +1,6 @@ { "name": "ai-coding-config", - "version": "9.20.0", + "version": "9.21.0", "description": "Commands, agents, skills, and context for AI-assisted development workflows", "author": { "name": "TechNickAI", diff --git a/plugins/core/hooks/README.md b/plugins/core/hooks/README.md new file mode 100644 index 0000000..e21530e --- /dev/null +++ b/plugins/core/hooks/README.md @@ -0,0 +1,85 @@ +# Hooks + +Lifecycle hooks that fire during Claude Code sessions. Each hook is a small bash +script registered in `hooks.json` against a Claude Code event. + +## Conventions + +- Bash scripts (`.sh`), executable, with a `#!/bin/bash` shebang. +- Read JSON payload from stdin via `cat`, parse fields with `jq -r '.field // empty'`. +- Write any guidance for the model to stdout — Claude Code surfaces stdout as + context. Scripts that have nothing to say print nothing. +- Always `exit 0` unless the hook contract specifically supports blocking. A + failing hook that returns non-zero can disrupt the harness. +- Defensive parsing: handle missing fields, empty input, malformed JSON gracefully. + Fail open. +- Match registration in `hooks.json`. Use `${CLAUDE_PLUGIN_ROOT}/hooks/` for + the command path. + +## Installed Hooks + +### `session-multiplexing.sh` — `SessionStart` + +Counts recently-touched session marker files. When 3+ Claude Code sessions are +active, prints a reminder that the user is juggling windows so the assistant +includes explicit context (repo, branch, current task) in every status update. + +### `todo-persist.sh` — `PostToolUse` (TodoWrite) + +Mirrors the model's todo list to a human-readable `todos.md` in the project +directory each time `TodoWrite` runs. Survives compaction and lets the user see +in-flight work without the harness UI. + +### `verify-deliverables.sh` — `SubagentStop` + +Cross-references a subagent's recent activity against its completion claims. +Surfaces a structured advisory back to the parent agent when the tool-call record +or filesystem disagrees with what the subagent said it did. Always advisory, +never blocking. + +**What it checks:** + +- **Completion claim with zero mutations.** If the subagent's final message + contains strong completion language ("done.", "implemented X", "all tests pass", + ✅) and a file-path-shaped token, but the recent transcript has zero + `Write`/`Edit`/`MultiEdit`/`NotebookEdit` calls — flagged as a likely lie. +- **Tool call recorded but artifact missing.** For each path the subagent's + tool calls claim to have written, verifies the file exists on disk now. A + missing file means the tool failed silently, was deleted, or the call was + fabricated. + +**What it doesn't do:** + +- Doesn't verify *correctness* — only existence. Use `verify-fix` for behavioral + validation. +- Doesn't block the subagent from stopping. Output is added to + `hookSpecificOutput.additionalContext` so the parent agent sees the + discrepancy and decides what to do. +- Doesn't parse natural-language claims aggressively. Patterns are conservative + to keep false-positive rate low; we'd rather miss a few lies than nag on + honest analytical responses. + +**Escape hatches:** + +- Set `VERIFY_DELIVERABLES_DISABLED=1` in the environment to bypass entirely. +- The hook respects `stop_hook_active=true` and exits silently when the harness + signals it has already fired once this stop cycle, so it cannot loop. +- Any internal error or unreadable transcript causes a silent pass — a broken + hook never breaks the harness. + +**Why this exists:** + +The `` block in `CLAUDE.md`, the `verify-fix` skill, and the +`verification-before-completion` superpower all try to prevent subagents from +claiming work they haven't done. Prompts are advisory; this hook is structural. +See [issue #51](https://github.com/TechNickAI/ai-coding-config/issues/51) for +the design discussion. + +## Running Tests + +```bash +bash plugins/core/hooks/tests/test-verify-deliverables.sh +``` + +Test fixtures generate synthetic transcript JSONL and pipe synthetic stdin +payloads to the hook. No live Claude Code session required. diff --git a/plugins/core/hooks/hooks.json b/plugins/core/hooks/hooks.json index d900d8a..bbedab7 100644 --- a/plugins/core/hooks/hooks.json +++ b/plugins/core/hooks/hooks.json @@ -21,6 +21,17 @@ } ] } + ], + "SubagentStop": [ + { + "matcher": "*", + "hooks": [ + { + "type": "command", + "command": "${CLAUDE_PLUGIN_ROOT}/hooks/verify-deliverables.sh" + } + ] + } ] } } diff --git a/plugins/core/hooks/tests/test-verify-deliverables.sh b/plugins/core/hooks/tests/test-verify-deliverables.sh new file mode 100755 index 0000000..f0129aa --- /dev/null +++ b/plugins/core/hooks/tests/test-verify-deliverables.sh @@ -0,0 +1,215 @@ +#!/bin/bash +# Test harness for verify-deliverables.sh +# +# Generates synthetic SubagentStop payloads and transcript JSONL fixtures, +# pipes them to the hook, and asserts on its stdout. Doesn't require a +# live Claude Code session. +# +# Run from any cwd: +# bash plugins/core/hooks/tests/test-verify-deliverables.sh + +set -u + +HOOK="$(cd "$(dirname "$0")/.." && pwd)/verify-deliverables.sh" +if [ ! -x "$HOOK" ]; then + echo "ERROR: hook not executable at $HOOK" + exit 2 +fi + +TMPDIR=$(mktemp -d) +trap 'rm -rf "$TMPDIR"' EXIT + +PASS=0 +FAIL=0 + +# --- Helpers --- + +# Run a case. $expect is one of: +# silent — assert empty stdout +# advisory — assert stdout contains "additionalContext" +# — assert stdout contains the literal substring +run_case() { + local name="$1" + local payload="$2" + local expect="$3" + local output + output=$(printf '%s' "$payload" | bash "$HOOK" 2>&1) + + case "$expect" in + silent) + if [ -z "$output" ]; then + echo "PASS: $name" + PASS=$((PASS + 1)) + else + echo "FAIL: $name" + echo " expected: silent" + echo " got: $output" + FAIL=$((FAIL + 1)) + fi + ;; + advisory) + if echo "$output" | grep -q 'additionalContext'; then + echo "PASS: $name" + PASS=$((PASS + 1)) + else + echo "FAIL: $name" + echo " expected: advisory output" + echo " got: $output" + FAIL=$((FAIL + 1)) + fi + ;; + *) + if echo "$output" | grep -qF "$expect"; then + echo "PASS: $name" + PASS=$((PASS + 1)) + else + echo "FAIL: $name" + echo " expected substring: $expect" + echo " got: $output" + FAIL=$((FAIL + 1)) + fi + ;; + esac +} + +write_transcript() { + local file="$TMPDIR/$1.jsonl" + shift + : > "$file" + for line in "$@"; do + printf '%s\n' "$line" >> "$file" + done + printf '%s' "$file" +} + +make_payload() { + local transcript="$1" + local stop_hook_active="${2:-false}" + cat < "$TMPDIR/foo.ts" +TX_HONEST=$(write_transcript "honest" \ + "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"input\":{\"file_path\":\"$TMPDIR/foo.ts\",\"content\":\"content\"}}]}}" \ + '{"type":"assistant","message":{"content":[{"type":"text","text":"Done. Created `foo.ts` as requested."}]}}' +) + +# A liar: completion claim, zero mutating tool calls (only Read). +TX_LIAR_NO_ACTION=$(write_transcript "liar-no-action" \ + '{"type":"assistant","message":{"content":[{"type":"tool_use","name":"Read","input":{"file_path":"src/whatever.ts"}}]}}' \ + '{"type":"assistant","message":{"content":[{"type":"text","text":"Done. Implemented the change in src/whatever.ts."}]}}' +) + +# A ghost: tool call recorded but file is not on disk. +TX_GHOST=$(write_transcript "ghost" \ + "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"input\":{\"file_path\":\"$TMPDIR/ghost.md\",\"content\":\"x\"}}]}}" \ + '{"type":"assistant","message":{"content":[{"type":"text","text":"All set."}]}}' +) + +# Pure analyst: no completion claim, no mutations. +TX_ANALYST=$(write_transcript "analyst" \ + '{"type":"assistant","message":{"content":[{"type":"text","text":"I read the code and found 3 patterns worth examining further."}]}}' +) + +# Trivial chatter: completion-ish word but no file path tokens. +TX_TRIVIAL=$(write_transcript "trivial" \ + '{"type":"assistant","message":{"content":[{"type":"text","text":"All done with the analysis."}]}}' +) + +# Edit (not Write) to existing file. Should pass — Edit counts as mutating. +echo "before" > "$TMPDIR/edited.ts" +TX_EDIT=$(write_transcript "edit-honest" \ + "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Edit\",\"input\":{\"file_path\":\"$TMPDIR/edited.ts\",\"old_string\":\"before\",\"new_string\":\"after\"}}]}}" \ + '{"type":"assistant","message":{"content":[{"type":"text","text":"Done. Updated `edited.ts`."}]}}' +) + +# tool_input shape (alternative payload format) for Write. +echo "alt-content" > "$TMPDIR/alt.ts" +TX_ALT_SHAPE=$(write_transcript "alt-shape" \ + "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"tool_input\":{\"file_path\":\"$TMPDIR/alt.ts\"}}]}}" \ + '{"type":"assistant","message":{"content":[{"type":"text","text":"Done. Wrote alt.ts."}]}}' +) + +# MultiEdit should count as mutating. +echo "before" > "$TMPDIR/multi.ts" +TX_MULTIEDIT=$(write_transcript "multi-edit" \ + "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"MultiEdit\",\"input\":{\"file_path\":\"$TMPDIR/multi.ts\",\"edits\":[]}}]}}" \ + '{"type":"assistant","message":{"content":[{"type":"text","text":"Done. Updated multi.ts in several places."}]}}' +) + +# Negation: "not all tests pass" should not flag. +TX_NEGATION=$(write_transcript "negated" \ + '{"type":"assistant","message":{"content":[{"type":"tool_use","name":"Read","input":{"file_path":"src/foo.ts"}}]}}' \ + '{"type":"assistant","message":{"content":[{"type":"text","text":"Investigated src/foo.ts — not all tests pass yet."}]}}' +) + +# Trailing boundary: "fixed itself" should not match "fixed it". +TX_FIXED_ITSELF=$(write_transcript "fixed-itself" \ + '{"type":"assistant","message":{"content":[{"type":"tool_use","name":"Read","input":{"file_path":"src/x.ts"}}]}}' \ + '{"type":"assistant","message":{"content":[{"type":"text","text":"After reading src/x.ts the issue fixed itself when I retried."}]}}' +) + +# Broken symlink: tool created the symlink, target is gone. -L catches it. +ln -s "/nonexistent/target" "$TMPDIR/broken-link" +TX_BROKEN_SYMLINK=$(write_transcript "broken-symlink" \ + "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"input\":{\"file_path\":\"$TMPDIR/broken-link\"}}]}}" \ + '{"type":"assistant","message":{"content":[{"type":"text","text":"Done."}]}}' +) + +# Long transcript: Write at start, final message after many Read calls. +# Confirms the 200-line slice catches Writes that the old 50-line window missed. +echo "long-content" > "$TMPDIR/long.ts" +{ + printf '%s\n' "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"input\":{\"file_path\":\"$TMPDIR/long.ts\",\"content\":\"x\"}}]}}" + for i in $(seq 1 60); do + printf '%s\n' "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Read\",\"input\":{\"file_path\":\"src/file-$i.ts\"}}]}}" + done + printf '%s\n' '{"type":"assistant","message":{"content":[{"type":"text","text":"Done. Implemented in long.ts."}]}}' +} > "$TMPDIR/long-transcript.jsonl" +TX_LONG="$TMPDIR/long-transcript.jsonl" + +# Relative file_path with cwd — must resolve to $CWD/relative.ts. +echo "relative-content" > "$TMPDIR/relative.ts" +TX_RELATIVE=$(write_transcript "relative" \ + '{"type":"assistant","message":{"content":[{"type":"tool_use","name":"Write","input":{"file_path":"relative.ts"}}]}}' \ + '{"type":"assistant","message":{"content":[{"type":"text","text":"Done."}]}}' +) + +# --- Cases --- + +run_case "honest completion (write + file exists)" "$(make_payload "$TX_HONEST")" silent +run_case "edit + file exists" "$(make_payload "$TX_EDIT")" silent +run_case "MultiEdit counts as mutating" "$(make_payload "$TX_MULTIEDIT")" silent +run_case "alt payload shape (.tool_input.file_path)" "$(make_payload "$TX_ALT_SHAPE")" silent +run_case "pure analyst (no claim, no mutations)" "$(make_payload "$TX_ANALYST")" silent +run_case "trivial chatter (claim word but no file token)" "$(make_payload "$TX_TRIVIAL")" silent +run_case "negated 'not all tests pass' (no flag)" "$(make_payload "$TX_NEGATION")" silent +run_case "'fixed itself' does not match 'fixed it'" "$(make_payload "$TX_FIXED_ITSELF")" silent +run_case "broken symlink counts as present" "$(make_payload "$TX_BROKEN_SYMLINK")" silent +run_case "long transcript (Write at line 1, msg after 60 Reads)" "$(make_payload "$TX_LONG")" silent +run_case "relative file_path resolves against cwd" "$(make_payload "$TX_RELATIVE")" silent +run_case "liar — completion + 0 mutations" "$(make_payload "$TX_LIAR_NO_ACTION")" "made no file changes" +run_case "ghost — tool call but file missing on disk" "$(make_payload "$TX_GHOST")" "not present on disk" +run_case "escape hatch (stop_hook_active=true)" "$(make_payload "$TX_LIAR_NO_ACTION" true)" silent +run_case "missing transcript path (fail-open)" '{"stop_hook_active":false}' silent +run_case "non-existent transcript file (fail-open)" "$(make_payload "/nonexistent/transcript.jsonl")" silent +run_case "empty stdin (fail-open)" '' silent + +# Kill switch — set env var, expect silent regardless of payload. +KILL_OUT=$(printf '%s' "$(make_payload "$TX_LIAR_NO_ACTION")" | VERIFY_DELIVERABLES_DISABLED=1 bash "$HOOK" 2>&1) +if [ -z "$KILL_OUT" ]; then + echo "PASS: kill switch (VERIFY_DELIVERABLES_DISABLED=1)" + PASS=$((PASS + 1)) +else + echo "FAIL: kill switch — got: $KILL_OUT" + FAIL=$((FAIL + 1)) +fi + +echo "" +echo "Results: $PASS passed, $FAIL failed" +[ "$FAIL" -eq 0 ] diff --git a/plugins/core/hooks/verify-deliverables.sh b/plugins/core/hooks/verify-deliverables.sh new file mode 100755 index 0000000..3af6ac7 --- /dev/null +++ b/plugins/core/hooks/verify-deliverables.sh @@ -0,0 +1,132 @@ +#!/bin/bash +# SubagentStop hook: verify-deliverables +# +# Cross-references the subagent's recent activity against its completion claims. +# When the subagent says it finished work but the tool-call record or filesystem +# disagree, this hook surfaces the discrepancy to the parent agent via +# hookSpecificOutput.additionalContext. +# +# Always advisory, never blocks. Exits 0 in every path so a hook bug cannot +# break the harness. See plugins/core/hooks/README.md for design notes. +# +# Issue: https://github.com/TechNickAI/ai-coding-config/issues/51 + +set -u + +# Kill switch for opt-out without uninstalling. +[ "${VERIFY_DELIVERABLES_DISABLED:-}" = "1" ] && exit 0 + +# Hook depends on jq for JSON parsing. If jq is missing, fail open silently +# rather than producing a non-zero exit that disrupts the harness. +command -v jq >/dev/null 2>&1 || exit 0 + +# --- Read input --- + +INPUT=$(cat) +[ -z "$INPUT" ] && exit 0 + +TRANSCRIPT_PATH=$(echo "$INPUT" | jq -r '.transcript_path // empty' 2>/dev/null) +STOP_HOOK_ACTIVE=$(echo "$INPUT" | jq -r '.stop_hook_active // false' 2>/dev/null) +CWD=$(echo "$INPUT" | jq -r '.cwd // empty' 2>/dev/null) +[ -z "$CWD" ] && CWD="$(pwd)" + +# Escape hatch: hook already fired this stop cycle. Prevents loops. +[ "$STOP_HOOK_ACTIVE" = "true" ] && exit 0 + +# Fail-open: no readable transcript -> silent pass. +[ -z "$TRANSCRIPT_PATH" ] && exit 0 +[ ! -r "$TRANSCRIPT_PATH" ] && exit 0 + +# --- Slice recent activity --- + +# 200 entries handles subagents that ran many Read calls between an early +# Write and a final summary, while still bounding parse time. +RECENT=$(tail -n 200 "$TRANSCRIPT_PATH" 2>/dev/null) +[ -z "$RECENT" ] && exit 0 + +# --- Extract signals --- + +LAST_ASSISTANT=$(echo "$RECENT" | jq -c 'select(.type == "assistant")' 2>/dev/null | tail -n 1) +LAST_MSG=$(echo "$LAST_ASSISTANT" | jq -r '.message.content[]? | select(.type == "text") | .text' 2>/dev/null) + +# Mutating tool calls in the recent slice. +MUTATING_CALLS=$(echo "$RECENT" | jq -r ' + select(.type == "assistant") + | .message.content[]? + | select(.type == "tool_use") + | select(.name == "Write" or .name == "Edit" or .name == "MultiEdit" or .name == "NotebookEdit") + | .name +' 2>/dev/null | wc -l | tr -d ' ') + +# File paths from those tool calls. Hedge between .input and .tool_input +# since transcript shape can differ from hook-payload shape across versions. +# Strip embedded newlines so the read loop iterates one path per line. +CALLED_PATHS=$(echo "$RECENT" | jq -r ' + select(.type == "assistant") + | .message.content[]? + | select(.type == "tool_use") + | select(.name == "Write" or .name == "Edit" or .name == "MultiEdit" or .name == "NotebookEdit") + | (.input.file_path // .tool_input.file_path // empty) + | gsub("[\\n\\r]"; " ") +' 2>/dev/null | sort -u) + +# --- Check A: completion claim with zero mutations --- + +ISSUES="" + +if [ -n "$LAST_MSG" ] && [ "$MUTATING_CALLS" = "0" ]; then + # Conservative completion patterns. False positives are noisier than + # false negatives, so we only fire on strong signals. Trailing + # boundary keeps "fixed it" from matching inside "fixed itself" + # and "all tests pass" from matching inside "passenger". + if echo "$LAST_MSG" | grep -qiE '(^|[^a-z])(✅|done\.|complete\.|completed\.|finished\.|implemented |fixed it|created the |wrote the |added the |saved the |all tests pass(ed)?)([^a-zA-Z]|$)'; then + # Reject negated phrasing — "not all tests pass" should not trigger. + if ! echo "$LAST_MSG" | grep -qiE '(not|n.?t|cannot|couldn.?t|didn.?t|haven.?t|hasn.?t|wasn.?t)([[:space:]]+(yet|fully|quite|all|even))?[[:space:]]+(✅|done|complete|completed|finished|implemented|fixed|added|wrote|created|saved|all tests pass)'; then + # Require a file-path-shaped token. Pure analysis output + # ("found 3 patterns in the codebase") shouldn't trigger. + if echo "$LAST_MSG" | grep -qE '[a-zA-Z0-9_/-]+\.[a-zA-Z]{1,5}'; then + ISSUES="${ISSUES}- Subagent claims completion but made no file changes (Write/Edit/MultiEdit/NotebookEdit count: 0).\n" + fi + fi + fi +fi + +# --- Check B: tool-call paths that don't exist on disk --- + +if [ -n "$CALLED_PATHS" ]; then + while IFS= read -r path; do + [ -z "$path" ] && continue + case "$path" in + /*) full_path="$path" ;; + *) full_path="$CWD/$path" ;; + esac + # `-e` follows symlinks; a broken symlink would falsely flag. + # `-L` catches the symlink itself — tool created it successfully + # even if the target moved. + if [ ! -e "$full_path" ] && [ ! -L "$full_path" ]; then + ISSUES="${ISSUES}- Tool call recorded Write/Edit to '$path' but the file is not present on disk.\n" + fi + done </dev/null || true + +exit 0 From 29a6e0a64de84c9a75ad6282d923eb904781210d Mon Sep 17 00:00:00 2001 From: Nick Sullivan Date: Mon, 4 May 2026 17:50:27 -0500 Subject: [PATCH 2/2] =?UTF-8?q?=F0=9F=90=9B=20Fix=20bot-review=20issues=20?= =?UTF-8?q?in=20verify-deliverables=20hook?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Read agent_transcript_path (subagent's own log) instead of transcript_path (parent session); add last_assistant_message shortcut - Fix n.?t → n't in negation guard (n.?t matched 'nt' inside 'count', 'front', 'environment', etc.) - Replace TMPDIR with TEST_DIR in test harness (TMPDIR is a reserved env var that mktemp reads internally) - Add NotebookEdit test case to complete mutation-type coverage - Switch ISSUES separator from literal \n to $'\n'; drop printf %b in MESSAGE to prevent backslash interpretation in file paths - Remove ✅ from completion pattern (locale portability; prose patterns are sufficient) - Payload now includes agent_transcript_path in test fixtures All 19 test cases pass. Co-Authored-By: Claude Opus 4.7 --- .../hooks/tests/test-verify-deliverables.sh | 48 +++++++++++-------- plugins/core/hooks/verify-deliverables.sh | 25 ++++++---- 2 files changed, 45 insertions(+), 28 deletions(-) diff --git a/plugins/core/hooks/tests/test-verify-deliverables.sh b/plugins/core/hooks/tests/test-verify-deliverables.sh index f0129aa..01863d0 100755 --- a/plugins/core/hooks/tests/test-verify-deliverables.sh +++ b/plugins/core/hooks/tests/test-verify-deliverables.sh @@ -16,8 +16,8 @@ if [ ! -x "$HOOK" ]; then exit 2 fi -TMPDIR=$(mktemp -d) -trap 'rm -rf "$TMPDIR"' EXIT +TEST_DIR=$(mktemp -d) +trap 'rm -rf "$TEST_DIR"' EXIT PASS=0 FAIL=0 @@ -73,7 +73,7 @@ run_case() { } write_transcript() { - local file="$TMPDIR/$1.jsonl" + local file="$TEST_DIR/$1.jsonl" shift : > "$file" for line in "$@"; do @@ -86,16 +86,16 @@ make_payload() { local transcript="$1" local stop_hook_active="${2:-false}" cat < "$TMPDIR/foo.ts" +echo "content" > "$TEST_DIR/foo.ts" TX_HONEST=$(write_transcript "honest" \ - "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"input\":{\"file_path\":\"$TMPDIR/foo.ts\",\"content\":\"content\"}}]}}" \ + "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"input\":{\"file_path\":\"$TEST_DIR/foo.ts\",\"content\":\"content\"}}]}}" \ '{"type":"assistant","message":{"content":[{"type":"text","text":"Done. Created `foo.ts` as requested."}]}}' ) @@ -107,7 +107,7 @@ TX_LIAR_NO_ACTION=$(write_transcript "liar-no-action" \ # A ghost: tool call recorded but file is not on disk. TX_GHOST=$(write_transcript "ghost" \ - "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"input\":{\"file_path\":\"$TMPDIR/ghost.md\",\"content\":\"x\"}}]}}" \ + "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"input\":{\"file_path\":\"$TEST_DIR/ghost.md\",\"content\":\"x\"}}]}}" \ '{"type":"assistant","message":{"content":[{"type":"text","text":"All set."}]}}' ) @@ -122,26 +122,33 @@ TX_TRIVIAL=$(write_transcript "trivial" \ ) # Edit (not Write) to existing file. Should pass — Edit counts as mutating. -echo "before" > "$TMPDIR/edited.ts" +echo "before" > "$TEST_DIR/edited.ts" TX_EDIT=$(write_transcript "edit-honest" \ - "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Edit\",\"input\":{\"file_path\":\"$TMPDIR/edited.ts\",\"old_string\":\"before\",\"new_string\":\"after\"}}]}}" \ + "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Edit\",\"input\":{\"file_path\":\"$TEST_DIR/edited.ts\",\"old_string\":\"before\",\"new_string\":\"after\"}}]}}" \ '{"type":"assistant","message":{"content":[{"type":"text","text":"Done. Updated `edited.ts`."}]}}' ) # tool_input shape (alternative payload format) for Write. -echo "alt-content" > "$TMPDIR/alt.ts" +echo "alt-content" > "$TEST_DIR/alt.ts" TX_ALT_SHAPE=$(write_transcript "alt-shape" \ - "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"tool_input\":{\"file_path\":\"$TMPDIR/alt.ts\"}}]}}" \ + "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"tool_input\":{\"file_path\":\"$TEST_DIR/alt.ts\"}}]}}" \ '{"type":"assistant","message":{"content":[{"type":"text","text":"Done. Wrote alt.ts."}]}}' ) # MultiEdit should count as mutating. -echo "before" > "$TMPDIR/multi.ts" +echo "before" > "$TEST_DIR/multi.ts" TX_MULTIEDIT=$(write_transcript "multi-edit" \ - "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"MultiEdit\",\"input\":{\"file_path\":\"$TMPDIR/multi.ts\",\"edits\":[]}}]}}" \ + "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"MultiEdit\",\"input\":{\"file_path\":\"$TEST_DIR/multi.ts\",\"edits\":[]}}]}}" \ '{"type":"assistant","message":{"content":[{"type":"text","text":"Done. Updated multi.ts in several places."}]}}' ) +# NotebookEdit should count as mutating. +echo "before" > "$TEST_DIR/nb.ipynb" +TX_NOTEBOOK=$(write_transcript "notebook-edit" \ + "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"NotebookEdit\",\"input\":{\"file_path\":\"$TEST_DIR/nb.ipynb\",\"cell_number\":0,\"new_source\":\"x\"}}]}}" \ + '{"type":"assistant","message":{"content":[{"type":"text","text":"Done. Updated the notebook."}]}}' +) + # Negation: "not all tests pass" should not flag. TX_NEGATION=$(write_transcript "negated" \ '{"type":"assistant","message":{"content":[{"type":"tool_use","name":"Read","input":{"file_path":"src/foo.ts"}}]}}' \ @@ -155,26 +162,26 @@ TX_FIXED_ITSELF=$(write_transcript "fixed-itself" \ ) # Broken symlink: tool created the symlink, target is gone. -L catches it. -ln -s "/nonexistent/target" "$TMPDIR/broken-link" +ln -s "/nonexistent/target" "$TEST_DIR/broken-link" TX_BROKEN_SYMLINK=$(write_transcript "broken-symlink" \ - "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"input\":{\"file_path\":\"$TMPDIR/broken-link\"}}]}}" \ + "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"input\":{\"file_path\":\"$TEST_DIR/broken-link\"}}]}}" \ '{"type":"assistant","message":{"content":[{"type":"text","text":"Done."}]}}' ) # Long transcript: Write at start, final message after many Read calls. # Confirms the 200-line slice catches Writes that the old 50-line window missed. -echo "long-content" > "$TMPDIR/long.ts" +echo "long-content" > "$TEST_DIR/long.ts" { - printf '%s\n' "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"input\":{\"file_path\":\"$TMPDIR/long.ts\",\"content\":\"x\"}}]}}" + printf '%s\n' "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Write\",\"input\":{\"file_path\":\"$TEST_DIR/long.ts\",\"content\":\"x\"}}]}}" for i in $(seq 1 60); do printf '%s\n' "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"tool_use\",\"name\":\"Read\",\"input\":{\"file_path\":\"src/file-$i.ts\"}}]}}" done printf '%s\n' '{"type":"assistant","message":{"content":[{"type":"text","text":"Done. Implemented in long.ts."}]}}' -} > "$TMPDIR/long-transcript.jsonl" -TX_LONG="$TMPDIR/long-transcript.jsonl" +} > "$TEST_DIR/long-transcript.jsonl" +TX_LONG="$TEST_DIR/long-transcript.jsonl" # Relative file_path with cwd — must resolve to $CWD/relative.ts. -echo "relative-content" > "$TMPDIR/relative.ts" +echo "relative-content" > "$TEST_DIR/relative.ts" TX_RELATIVE=$(write_transcript "relative" \ '{"type":"assistant","message":{"content":[{"type":"tool_use","name":"Write","input":{"file_path":"relative.ts"}}]}}' \ '{"type":"assistant","message":{"content":[{"type":"text","text":"Done."}]}}' @@ -185,6 +192,7 @@ TX_RELATIVE=$(write_transcript "relative" \ run_case "honest completion (write + file exists)" "$(make_payload "$TX_HONEST")" silent run_case "edit + file exists" "$(make_payload "$TX_EDIT")" silent run_case "MultiEdit counts as mutating" "$(make_payload "$TX_MULTIEDIT")" silent +run_case "NotebookEdit counts as mutating" "$(make_payload "$TX_NOTEBOOK")" silent run_case "alt payload shape (.tool_input.file_path)" "$(make_payload "$TX_ALT_SHAPE")" silent run_case "pure analyst (no claim, no mutations)" "$(make_payload "$TX_ANALYST")" silent run_case "trivial chatter (claim word but no file token)" "$(make_payload "$TX_TRIVIAL")" silent diff --git a/plugins/core/hooks/verify-deliverables.sh b/plugins/core/hooks/verify-deliverables.sh index 3af6ac7..0e31257 100755 --- a/plugins/core/hooks/verify-deliverables.sh +++ b/plugins/core/hooks/verify-deliverables.sh @@ -25,7 +25,10 @@ command -v jq >/dev/null 2>&1 || exit 0 INPUT=$(cat) [ -z "$INPUT" ] && exit 0 -TRANSCRIPT_PATH=$(echo "$INPUT" | jq -r '.transcript_path // empty' 2>/dev/null) +# SubagentStop provides agent_transcript_path (subagent's own log); fall back +# to transcript_path for callers that only set the session-level field. +TRANSCRIPT_PATH=$(echo "$INPUT" | jq -r '.agent_transcript_path // empty' 2>/dev/null) +[ -z "$TRANSCRIPT_PATH" ] && TRANSCRIPT_PATH=$(echo "$INPUT" | jq -r '.transcript_path // empty' 2>/dev/null) STOP_HOOK_ACTIVE=$(echo "$INPUT" | jq -r '.stop_hook_active // false' 2>/dev/null) CWD=$(echo "$INPUT" | jq -r '.cwd // empty' 2>/dev/null) [ -z "$CWD" ] && CWD="$(pwd)" @@ -46,8 +49,13 @@ RECENT=$(tail -n 200 "$TRANSCRIPT_PATH" 2>/dev/null) # --- Extract signals --- -LAST_ASSISTANT=$(echo "$RECENT" | jq -c 'select(.type == "assistant")' 2>/dev/null | tail -n 1) -LAST_MSG=$(echo "$LAST_ASSISTANT" | jq -r '.message.content[]? | select(.type == "text") | .text' 2>/dev/null) +# last_assistant_message is provided directly in the SubagentStop payload; +# fall back to JSONL parsing when running against older harness versions. +LAST_MSG=$(echo "$INPUT" | jq -r '.last_assistant_message // empty' 2>/dev/null) +if [ -z "$LAST_MSG" ]; then + LAST_ASSISTANT=$(echo "$RECENT" | jq -c 'select(.type == "assistant")' 2>/dev/null | tail -n 1) + LAST_MSG=$(echo "$LAST_ASSISTANT" | jq -r '.message.content[]? | select(.type == "text") | .text' 2>/dev/null) +fi # Mutating tool calls in the recent slice. MUTATING_CALLS=$(echo "$RECENT" | jq -r ' @@ -79,13 +87,13 @@ if [ -n "$LAST_MSG" ] && [ "$MUTATING_CALLS" = "0" ]; then # false negatives, so we only fire on strong signals. Trailing # boundary keeps "fixed it" from matching inside "fixed itself" # and "all tests pass" from matching inside "passenger". - if echo "$LAST_MSG" | grep -qiE '(^|[^a-z])(✅|done\.|complete\.|completed\.|finished\.|implemented |fixed it|created the |wrote the |added the |saved the |all tests pass(ed)?)([^a-zA-Z]|$)'; then + if echo "$LAST_MSG" | grep -qiE '(^|[^a-z])(done\.|complete\.|completed\.|finished\.|implemented |fixed it|created the |wrote the |added the |saved the |all tests pass(ed)?)([^a-zA-Z]|$)'; then # Reject negated phrasing — "not all tests pass" should not trigger. - if ! echo "$LAST_MSG" | grep -qiE '(not|n.?t|cannot|couldn.?t|didn.?t|haven.?t|hasn.?t|wasn.?t)([[:space:]]+(yet|fully|quite|all|even))?[[:space:]]+(✅|done|complete|completed|finished|implemented|fixed|added|wrote|created|saved|all tests pass)'; then + if ! echo "$LAST_MSG" | grep -qiE "(not|n't|cannot|couldn't|didn't|haven't|hasn't|wasn't)([[:space:]]+(yet|fully|quite|all|even))?[[:space:]]+(done|complete|completed|finished|implemented|fixed|added|wrote|created|saved|all tests pass)"; then # Require a file-path-shaped token. Pure analysis output # ("found 3 patterns in the codebase") shouldn't trigger. if echo "$LAST_MSG" | grep -qE '[a-zA-Z0-9_/-]+\.[a-zA-Z]{1,5}'; then - ISSUES="${ISSUES}- Subagent claims completion but made no file changes (Write/Edit/MultiEdit/NotebookEdit count: 0).\n" + ISSUES="${ISSUES}- Subagent claims completion but made no file changes (Write/Edit/MultiEdit/NotebookEdit count: 0)."$'\n' fi fi fi @@ -104,7 +112,7 @@ if [ -n "$CALLED_PATHS" ]; then # `-L` catches the symlink itself — tool created it successfully # even if the target moved. if [ ! -e "$full_path" ] && [ ! -L "$full_path" ]; then - ISSUES="${ISSUES}- Tool call recorded Write/Edit to '$path' but the file is not present on disk.\n" + ISSUES="${ISSUES}- Tool call recorded Write/Edit to '$path' but the file is not present on disk."$'\n' fi done <