Idle session reflection — in-turn memory hook + LLM-driven reflection cron · workstream

Goal

Eliminate the "I forgot to /handoff before closing the tab" failure mode AND raise the floor on what gets preserved across sessions. Two distinct mechanisms because they have different latency tolerances:

In-turn memory hook (Stop-event, ~50× per session, sub-second budget): mechanical — if the operator said something that fits the memory schema (user, feedback, project, reference per CLAUDE.md memory section), capture it; otherwise no-op.
Idle reflection cron (launchd timer, every ~10 min, ~15-30s budget per fire): intelligent — detects sessions with no operator activity for >15 min that haven't /handoffed. For each, runs an LLM-driven analysis pass: what was newly learned, what was important, what this session was about, what decisions matter for the next session. Drafts a handoff entry, ratifies it, commits.

Why both? The Stop hook captures the cheap stuff incrementally; the cron does the expensive reflection only when needed (idle session = candidate for premature closure). Without the cron, the Stop hook eventually fires the last time without ceremony and the session's analysis layer never runs. Without the Stop hook, every reflection cycle has to re-read the whole transcript from scratch — expensive.

Tasks

[ ] A0. Run /plan before any code is written. MUST run /plan plans/idle-session-reflection.md (or similar slug) first. This workstream touches every session's mechanics; the design questions (when is "idle"? what triggers reflection? what does reflection actually output? cost ceiling per session?) are real and warrant an interview. The brief below is interview material, not a spec — the Stop hook side is reasonably mechanical, but the reflection side has 6+ open questions in the "Reflection design open questions" section.
[ ] A1. Stop-event hook — ~/.claude/hooks/on-stop-memory-update.sh (or similar). Configured via settings.json hooks block. Fires after every assistant turn. Body:
- Reads the last user prompt + last assistant response from the transcript.
- Heuristic detection (regex + length thresholds, not LLM) for memory candidates: explicit "remember that…", explicit "stop doing X" / "always do Y", explicit project-state announcements ("we're freezing merges Thursday").
- If a candidate is found, writes a queue entry to ~/.claude/memory_queue/<session-id>-<turn>.json for batch-processing by the reflection cron (do not write to .claude/memory/ directly from the Stop hook — the hook fires too often to make reasonable judgement calls; queue + dedupe at reflection time is safer).
- Idempotent; no-op on most turns.
- Sub-second latency budget; never blocks the next turn.
[ ] A2. Idle-detection daemon — launchd LaunchAgent at ~/Library/LaunchAgents/online.offplan.idle-reflector.plist. Fires every 10 min. Body:
- Enumerates active Claude Code sessions via the harness's session-tracking dir (path TBD during /plan).
- For each session, computes now - last_turn_timestamp. If >15 min AND no /handoff commit since the session's /resume commit → candidate for reflection.
- Calls the reflection script (A3) with the session's transcript path.
[ ] A3. Reflection script — scripts/reflect_session.py. Inputs: transcript path. Outputs: handoff candidate written to docs/sessions/<id>.draft.md + memory candidates written to .claude/memory/*.md (with the canonical frontmatter).
- Reads transcript (last N turns or whole thing if short).
- Reads .claude/memory_queue/<session-id>-*.json produced by A1.
- Calls Claude API (or shells out to a claude headless command — design call) with a structured prompt: > Analyse the following Claude Code session. Identify: (1) what was newly learned (memory candidates by type), (2) what was important to record (handoff Summary + Decisions), (3) what this session was about (workstream classification, topic line), (4) Resume Prompt for the next session. Return JSON with these fields.
- Parses JSON; writes the outputs to disk.
- Commits + pushes via the standard /handoff pathway (scripts/handoff_helpers.py write-session if that interface fits, else duplicate the small commit logic).
[ ] A4. Operator-facing surface — when an idle session is auto-handed-off, surface it next time the operator opens the repo. Two options to evaluate during /plan: (a) Notion DM-style notification, (b) a small banner on /resume saying "Your previous session was auto-handed-off at $TIME. Review the draft at …"— probably (b), cheaper and offline-friendly.
[ ] A5. Cost ceiling + safety. The reflection script can run multiple times per day across active terminals. Set a daily cap (e.g. 20 reflections / day across the whole machine). Cap configurable. Above the cap, the daemon logs and skips. This prevents a runaway loop from burning a tank of API credits.
[ ] A6. Tests — the heuristic detection in A1 needs test cases; the reflection prompt in A3 needs a golden-output fixture; the daemon's idle-detection logic needs to be testable without launchd (pure-Python entry point + a script that invokes it).
[ ] A7. Document the convention in docs/conventions/idle-reflection.md: when does this fire, what does it produce, what's the operator's role (review the draft, ratify or amend), what's the cost envelope, how to disable.
[ ] A8. Wire scripts/setup.sh Step — add a step to install/load the LaunchAgent on first setup; idempotent on re-run.

What the reflection should ask itself

The obvious four (what was learned / what was important / what was the session about / what's the Resume Prompt) are the floor, not the ceiling. The reflection's value comes from catching the things the operator wouldn't think to write down — Roman's framing: "could be other things that I missed."

Group A — content the operator obviously didn't capture

Failed experiments worth recording. What was tried that didn't work? Right now sessions record what succeeded; the failures often have higher long-term value (so we don't re-try them in 3 months). Look for git reset, git revert, deleted files, abandoned branches, agent dispatches that got TaskStop'd. Each is a "we tried X, ruled it out because Y" candidate for a learning entry.
Implicit decisions that drifted forward. When the operator said "OK fine" or didn't push back on a default the assistant proposed, that's an unrecorded decision. Surface them: "you implicitly accepted [choice]; future-you may want to know why."
Unstated assumptions surfaced. Every concrete decision sits on top of assumptions ("directory-style URLs" assumes stakeholders share them; "homegrown matcher" assumes we won't scale past 200 entries). Extract the load-bearing assumptions and tag them so they can be re-examined when context shifts.
External-event mentions. "Sergei flew in", "the developer pushed back", "customer escalated" — operator drops these in chat but they rarely land in the session record. The reflection should pick them up as project-memory candidates.
Naming inconsistencies. Within a single session: did the same thing get called multiple names (e.g. "autohooks" vs "idle reflection")? Pick one, flag the other, suggest update.
What WASN'T done. Scope at session start vs scope shipped. Delta = scope cut or scope creep. Worth knowing why. ("Started planning P11d, ended up only doing the migration — design was deferred.")

Group B — cross-session meta-patterns (needs access to recent sessions, not just current transcript)

Recurring failure modes. "Bash CWD drifted to agent worktree 4 times this session, 11 times across the last 5 sessions — this is a tooling pattern worth documenting / fixing." Memory of class feedback keyed on the failure shape.
Velocity and cost signals. This session = X tokens, Y minutes, N tool calls. Comparable sessions averaged A/B/M. If we're 2× over, investigate. If we're 2× under, capture what worked.
Convention adoption. "Pattern P was used 5 times this session, 23 times across the last 10 sessions." → Promote to docs/conventions/<slug>.md candidate. Inverse: "Pattern Q is documented as a convention but hasn't been used in 30 sessions" → demote or remove.
Stale Resume Prompts. Last session's Resume Prompt said "next step is X." This session did Y instead. Why? Legitimate priority change worth recording? Or scope error in the previous handoff?
Confidence calibration. Past Resume Prompts ended with "Confidence: H." For each, did the predicted next step actually work? Track the H/M/L → outcome mapping over time; recalibrate the confidence vocabulary.

Group C — graph + repo-state inferences

Dependencies discovered. "Workstream A blocked-by B" was set up front. Did this session reveal A also depends on C? Update A's frontmatter.
Cross-references not folded back. Did the peer Claude session (claude-peers MCP) say something coordination-relevant that didn't make it into a workstream file? Scan peer message log.
Code health drift. Repo now has N scripts, K workstreams, M plans. Are any folders getting unwieldy? Surface the meta-trend so the operator can decide.
Files touched but not noted. Git diff of session commits vs session frontmatter workstreams_touched. Mismatches → operator forgot to mention a workstream, OR scope crept silently.
Memory hits without writes. Did the assistant retrieve a memory but the operator's behaviour suggested the memory was wrong/outdated? Soft signal to flag the memory entry for review.

Group D — operator state cues (Roman's "things I missed")

Tone-of-voice / momentum. Did the operator's messages get terser (signal: low patience or rushing) or longer (signal: confused)? Did "ugh" / "wait" / "no actually" appear? Possible feedback memory: "operator was visibly fatigued when X happened — be more conservative next time."
Time-of-day flags. Sessions running 11pm–2am suggest urgency or burnout. Architectural decisions made in this window have higher revisit-rate. Tag the session with a "fatigue-window" marker so future Roman knows to second-guess.
Operator-explicit "remember this" misses. Sometimes Roman says "remember that X" but it slips through the in-turn memory hook (A1). Reflection should catch.
Outcome vs intention. At session start the implied goal was X. Final outcome was Y. Worth recording either as "we evolved the goal mid-session because Z" (legitimate) or as "we drifted off-task" (less legitimate).

This list is the menu the reflection's LLM prompt draws from — not a checklist. Reflection picks the 2–4 most-relevant items per session, writes them out, suggests memory / workstream / ADR / learning targets. Items that consistently produce nothing across many sessions get pruned from the prompt; items that consistently produce value get promoted. The prompt itself evolves.

Reflection design open questions (the /plan author interviews on these)

"Idle" threshold. 15 min is a guess. Some operators (Roman) leave tabs open all day with intermittent work; reflection should NOT fire on brief breaks. Maybe 30 min, maybe a sliding heuristic (idle increases as session length grows). Or: only reflect when the tab is closed/window is destroyed (Stop hook can detect "this is the last Stop" via a sentinel).
Reflection output is a DRAFT, not authoritative. The operator should review next session. But what's the friction cost of ALWAYS having a draft to review when you /resume? Maybe ratify auto-handoffs after N days of no review.
LLM choice. Use Claude API directly (programmatic, fastest) or shell out to claude headless (gets MCP tools + project context for free, slower). Cost difference matters at scale.
Memory dedupe. If the reflection extracts a "feedback" memory that's already recorded, don't duplicate. Use content_hash check against existing .claude/memory/*.md.
Privacy / scope. Reflection reads the full transcript. Should it skip sessions with sensitive content (financial, personal, legal)? Need a tag or filename hint.
Failure mode. If reflection produces a bad draft (hallucination, wrong workstream classification), what's the recovery path? The operator can edit the draft manually; the auto-pathway should log everything so the bad output can be diagnosed.
Composition with the existing /handoff skill. /handoff has 8+ steps (regen INDEX, render HTML, search index, frontmatter validation, etc.). Should the reflection cron call /handoff as a subprocess or duplicate those steps? Probably call /handoff to keep one source of truth.
What about the in-turn memory hook firing on Stop AFTER a /handoff? /handoff Step 5 already writes the canonical memory updates. Stop-hook should detect "this Stop came right after a /handoff" and skip its queue write to avoid double-recording.

What's Next

Blocked-pending-/plan. The Stop-hook side could ship faster (mechanical) but the reflection side carries 70% of the value and 90% of the design risk — need the interview before any code lands.

When unblocked: start with A1 (Stop hook) — cheap, independently useful, dogfoodable in days. Then A2 + A3 + A5 (the reflection daemon) once /plan ratifies the design questions.

Key Context

Parent plan: none yet; this brief is its own pre-plan source. Eventually a plans/idle-session-reflection.md will ratify the design.
Composition: complements /handoff (manual) and /resume (next-session bootstrap). The auto-pathway is for sessions where the operator forgot or got pulled away; it's never the primary path.
Memory schema: .claude/memory/<type>_<slug>.md per CLAUDE.md Memory section. Reflection should write into this same schema; no special "auto-generated" variant.
Settings.json hooks: Stop event already supported by Claude Code; no harness changes needed. The hook script + the LaunchAgent are the only new moving parts.
Cost envelope: with ~5 reflections / day at ~5K input tokens each = ~25K tokens / day = trivial. The cap in A5 is a safety net, not a usage forecast.

Session Log

RT-260511-04 (2026-05-11): Workstream created (draft). Brief drafted in response to Roman's request: "when session is idle, you need to think what was new that we learned, what was important to record, kind of do thinking about what was important in that session." Combo (Stop hook + idle cron) selected from /AskUserQuestion options. Real /plan interview deferred to next session.