offplan · online
Workstream · idle-session-reflection

Idle session reflection — in-turn memory hook + LLM-driven reflection cron

Two-part automation so sessions self-record without operator effort: (1) lightweight Stop-event hook updates `.claude/memory/` per turn when something notable happens; (2) launchd timer detects idle sessions and runs an LLM-driven reflection pass that decides what was newly learned, what was important, what this session was about, then drafts a /handoff entry.

Draftworkstreamidle-session-reflectionpriority P1
Owner
roman
Created
2026-05-11
Priority
P1
Tags
ops, claude-code, automation

Goal

Eliminate the "I forgot to /handoff before closing the tab" failure mode AND raise the floor on what gets preserved across sessions. Two distinct mechanisms because they have different latency tolerances:

  1. In-turn memory hook (Stop-event, ~50× per session, sub-second budget): mechanical — if the operator said something that fits the memory schema (user, feedback, project, reference per CLAUDE.md memory section), capture it; otherwise no-op.
  2. Idle reflection cron (launchd timer, every ~10 min, ~15-30s budget per fire): intelligent — detects sessions with no operator activity for >15 min that haven't /handoffed. For each, runs an LLM-driven analysis pass: what was newly learned, what was important, what this session was about, what decisions matter for the next session. Drafts a handoff entry, ratifies it, commits.

Why both? The Stop hook captures the cheap stuff incrementally; the cron does the expensive reflection only when needed (idle session = candidate for premature closure). Without the cron, the Stop hook eventually fires the last time without ceremony and the session's analysis layer never runs. Without the Stop hook, every reflection cycle has to re-read the whole transcript from scratch — expensive.

Tasks

What the reflection should ask itself

The obvious four (what was learned / what was important / what was the session about / what's the Resume Prompt) are the floor, not the ceiling. The reflection's value comes from catching the things the operator wouldn't think to write down — Roman's framing: "could be other things that I missed."

Group A — content the operator obviously didn't capture

  1. Failed experiments worth recording. What was tried that didn't work? Right now sessions record what succeeded; the failures often have higher long-term value (so we don't re-try them in 3 months). Look for git reset, git revert, deleted files, abandoned branches, agent dispatches that got TaskStop'd. Each is a "we tried X, ruled it out because Y" candidate for a learning entry.
  2. Implicit decisions that drifted forward. When the operator said "OK fine" or didn't push back on a default the assistant proposed, that's an unrecorded decision. Surface them: "you implicitly accepted [choice]; future-you may want to know why."
  3. Unstated assumptions surfaced. Every concrete decision sits on top of assumptions ("directory-style URLs" assumes stakeholders share them; "homegrown matcher" assumes we won't scale past 200 entries). Extract the load-bearing assumptions and tag them so they can be re-examined when context shifts.
  4. External-event mentions. "Sergei flew in", "the developer pushed back", "customer escalated" — operator drops these in chat but they rarely land in the session record. The reflection should pick them up as project-memory candidates.
  5. Naming inconsistencies. Within a single session: did the same thing get called multiple names (e.g. "autohooks" vs "idle reflection")? Pick one, flag the other, suggest update.
  6. What WASN'T done. Scope at session start vs scope shipped. Delta = scope cut or scope creep. Worth knowing why. ("Started planning P11d, ended up only doing the migration — design was deferred.")

Group B — cross-session meta-patterns (needs access to recent sessions, not just current transcript)

  1. Recurring failure modes. "Bash CWD drifted to agent worktree 4 times this session, 11 times across the last 5 sessions — this is a tooling pattern worth documenting / fixing." Memory of class feedback keyed on the failure shape.
  2. Velocity and cost signals. This session = X tokens, Y minutes, N tool calls. Comparable sessions averaged A/B/M. If we're 2× over, investigate. If we're 2× under, capture what worked.
  3. Convention adoption. "Pattern P was used 5 times this session, 23 times across the last 10 sessions." → Promote to docs/conventions/<slug>.md candidate. Inverse: "Pattern Q is documented as a convention but hasn't been used in 30 sessions" → demote or remove.
  4. Stale Resume Prompts. Last session's Resume Prompt said "next step is X." This session did Y instead. Why? Legitimate priority change worth recording? Or scope error in the previous handoff?
  5. Confidence calibration. Past Resume Prompts ended with "Confidence: H." For each, did the predicted next step actually work? Track the H/M/L → outcome mapping over time; recalibrate the confidence vocabulary.

Group C — graph + repo-state inferences

  1. Dependencies discovered. "Workstream A blocked-by B" was set up front. Did this session reveal A also depends on C? Update A's frontmatter.
  2. Cross-references not folded back. Did the peer Claude session (claude-peers MCP) say something coordination-relevant that didn't make it into a workstream file? Scan peer message log.
  3. Code health drift. Repo now has N scripts, K workstreams, M plans. Are any folders getting unwieldy? Surface the meta-trend so the operator can decide.
  4. Files touched but not noted. Git diff of session commits vs session frontmatter workstreams_touched. Mismatches → operator forgot to mention a workstream, OR scope crept silently.
  5. Memory hits without writes. Did the assistant retrieve a memory but the operator's behaviour suggested the memory was wrong/outdated? Soft signal to flag the memory entry for review.

Group D — operator state cues (Roman's "things I missed")

  1. Tone-of-voice / momentum. Did the operator's messages get terser (signal: low patience or rushing) or longer (signal: confused)? Did "ugh" / "wait" / "no actually" appear? Possible feedback memory: "operator was visibly fatigued when X happened — be more conservative next time."
  2. Time-of-day flags. Sessions running 11pm–2am suggest urgency or burnout. Architectural decisions made in this window have higher revisit-rate. Tag the session with a "fatigue-window" marker so future Roman knows to second-guess.
  3. Operator-explicit "remember this" misses. Sometimes Roman says "remember that X" but it slips through the in-turn memory hook (A1). Reflection should catch.
  4. Outcome vs intention. At session start the implied goal was X. Final outcome was Y. Worth recording either as "we evolved the goal mid-session because Z" (legitimate) or as "we drifted off-task" (less legitimate).

This list is the menu the reflection's LLM prompt draws from — not a checklist. Reflection picks the 2–4 most-relevant items per session, writes them out, suggests memory / workstream / ADR / learning targets. Items that consistently produce nothing across many sessions get pruned from the prompt; items that consistently produce value get promoted. The prompt itself evolves.

Reflection design open questions (the /plan author interviews on these)

  1. "Idle" threshold. 15 min is a guess. Some operators (Roman) leave tabs open all day with intermittent work; reflection should NOT fire on brief breaks. Maybe 30 min, maybe a sliding heuristic (idle increases as session length grows). Or: only reflect when the tab is closed/window is destroyed (Stop hook can detect "this is the last Stop" via a sentinel).
  2. Reflection output is a DRAFT, not authoritative. The operator should review next session. But what's the friction cost of ALWAYS having a draft to review when you /resume? Maybe ratify auto-handoffs after N days of no review.
  3. LLM choice. Use Claude API directly (programmatic, fastest) or shell out to claude headless (gets MCP tools + project context for free, slower). Cost difference matters at scale.
  4. Memory dedupe. If the reflection extracts a "feedback" memory that's already recorded, don't duplicate. Use content_hash check against existing .claude/memory/*.md.
  5. Privacy / scope. Reflection reads the full transcript. Should it skip sessions with sensitive content (financial, personal, legal)? Need a tag or filename hint.
  6. Failure mode. If reflection produces a bad draft (hallucination, wrong workstream classification), what's the recovery path? The operator can edit the draft manually; the auto-pathway should log everything so the bad output can be diagnosed.
  7. Composition with the existing /handoff skill. /handoff has 8+ steps (regen INDEX, render HTML, search index, frontmatter validation, etc.). Should the reflection cron call /handoff as a subprocess or duplicate those steps? Probably call /handoff to keep one source of truth.
  8. What about the in-turn memory hook firing on Stop AFTER a /handoff? /handoff Step 5 already writes the canonical memory updates. Stop-hook should detect "this Stop came right after a /handoff" and skip its queue write to avoid double-recording.

What's Next

Blocked-pending-/plan. The Stop-hook side could ship faster (mechanical) but the reflection side carries 70% of the value and 90% of the design risk — need the interview before any code lands.

When unblocked: start with A1 (Stop hook) — cheap, independently useful, dogfoodable in days. Then A2 + A3 + A5 (the reflection daemon) once /plan ratifies the design questions.

Key Context

Session Log