From d00860b12ae70b6f91389c3e5e0081f3b73f400c Mon Sep 17 00:00:00 2001
From: dotta <dotta@example.com>
Date: Wed, 8 Apr 2026 17:22:20 -0500
Subject: [PATCH] Add in-progress issue recovery plan

---
 .../2026-04-08-in-progress-issue-recovery.md  | 225 ++++++++++++++++++
 1 file changed, 225 insertions(+)
 create mode 100644 doc/plans/2026-04-08-in-progress-issue-recovery.md

diff --git a/doc/plans/2026-04-08-in-progress-issue-recovery.md b/doc/plans/2026-04-08-in-progress-issue-recovery.md
new file mode 100644
index 00000000..32cea208
--- /dev/null
+++ b/doc/plans/2026-04-08-in-progress-issue-recovery.md
@@ -0,0 +1,225 @@
+# 2026-04-08 In-Progress Issue Recovery Plan
+
+Status: Proposed
+Date: 2026-04-08
+Audience: Product and engineering
+Related:
+- `server/src/services/heartbeat.ts`
+- `server/src/services/issues.ts`
+- `server/src/services/issue-assignment-wakeup.ts`
+- `server/src/routes/issues.ts`
+- `server/src/__tests__/heartbeat-process-recovery.test.ts`
+- `server/src/__tests__/issues-checkout-wakeup.test.ts`
+- [PAP-1227](/PAP/issues/PAP-1227)
+
+## 1. Purpose
+
+This note defines how Paperclip should handle an issue that is:
+
+- still `in_progress`
+- still assigned
+- but no longer has anyone actively working on it
+
+The problem is not just stale UI. It is a control-plane gap: the issue still looks owned, but no future wake is guaranteed, so work can stop indefinitely.
+
+## 2. Current Behavior
+
+Paperclip already has several partial protections:
+
+- checkout adoption when a stale `checkoutRunId` points at a terminal or missing run in `server/src/services/issues.ts`
+- execution lock cleanup when `executionRunId` points at a non-active run in both `issues.ts` and `heartbeat.ts`
+- orphaned local process recovery in `heartbeat.reapOrphanedRuns()`
+- deferred wake promotion in `releaseIssueExecutionAndPromote()`
+- one follow-up retry when a run ends without posting an issue comment
+
+What is still missing is a continuity rule for the issue itself.
+
+When a heartbeat run finishes and the issue remains `in_progress`, Paperclip currently clears `executionRunId` and may promote an already-deferred wake. If there is no deferred wake, the issue is simply left assigned and `in_progress`.
+
+That means an issue can legitimately end up in this state:
+
+- `status = in_progress`
+- `assigneeAgentId != null`
+- `executionRunId = null`
+- `checkoutRunId` points at an old finished run, or is otherwise stale
+- no queued/running wake exists for the issue
+
+At that point, nothing automatically resumes the work.
+
+## 3. Root Cause
+
+The system enforces comment continuity, but not execution continuity.
+
+Today the lifecycle is effectively:
+
+1. wake the assignee
+2. run one heartbeat
+3. require a comment
+4. stop unless some other event happens
+
+That is fine for tasks that move themselves to `done`, `blocked`, or `in_review` in one heartbeat. It fails for work that legitimately spans multiple heartbeats but does not produce a new external trigger.
+
+This is why the issue can "just sit there": there is no invariant saying "`in_progress` must imply an active run, a queued continuation, or an explicit waiting state."
+
+## 4. Desired Invariant
+
+For an assigned issue, `in_progress` should mean one of these is true:
+
+1. there is an active execution run for the issue
+2. there is a queued/deferred wake that will resume the issue soon
+3. the system has exhausted bounded automatic recovery and has surfaced the issue for explicit human/agent intervention
+
+What must not be allowed as a steady state is:
+
+- assigned
+- `in_progress`
+- no active run
+- no queued continuation
+- no visible escalation
+
+## 5. Proposed Plan
+
+## 5.1 Add a first-class orphaned-issue detector
+
+Introduce a shared helper that identifies an "orphaned in-progress issue":
+
+- `status === "in_progress"`
+- `assigneeAgentId` is present
+- no queued/running run currently owns the issue
+- no deferred wake already exists for the issue
+- `checkoutRunId` is null, missing, or points at a terminal/missing run
+
+This should live close to the existing issue/run ownership logic so the rules do not diverge.
+
+## 5.2 Queue one automatic continuation wake
+
+When a run finishes, after execution-lock release and deferred-wake promotion, check whether the linked issue is now orphaned.
+
+If it is, queue exactly one automatic continuation wake for the same assignee.
+
+Important constraints:
+
+- do not reassign the issue; V1 explicitly avoids automatic reassignment
+- do not reset the issue back to `todo`; it is still owned work
+- do not create duplicate queued continuation wakes if one already exists
+- keep using the existing stale-checkout adoption path so the next run can legally reclaim the old checkout
+
+Suggested wake reason:
+
+- `issue_continuation_needed`
+
+Suggested payload/context fields:
+
+- `issueId`
+- `retryOfRunId`
+- `wakeReason = "issue_continuation_needed"`
+- `retryReason = "issue_continuation_needed"`
+
+## 5.3 Bound retries and escalate explicitly
+
+The continuation wake must be bounded.
+
+Recommended rule:
+
+- first orphaning event: queue one automatic continuation wake
+- if the continuation wake also ends and the issue is still orphaned: stop retrying automatically and surface the problem
+
+Escalation behavior:
+
+- add an issue comment explaining that work is still `in_progress` but no live run remains
+- keep the assignee unchanged
+- move the issue to `blocked` only if we want strict workflow semantics for "waiting on intervention"
+
+My recommendation is:
+
+- keep the first recovery silent except for activity/run events
+- on exhaustion, add a comment and set `status = blocked`
+
+That creates a visible operator queue instead of leaving the issue silently stranded.
+
+## 5.4 Add a background sweep for legacy stranded issues
+
+Run finalization fixes future cases, but it does not repair issues already stranded in existing data.
+
+Add a periodic sweep, alongside other heartbeat housekeeping, that finds issues already matching the orphaned condition and applies the same recovery path.
+
+This sweep should:
+
+- skip issues that already have a queued continuation wake
+- skip issues whose assignee is paused/terminated/pending approval
+- queue a continuation wake when safe
+- otherwise add a visible escalation comment and/or mark `blocked`
+
+This sweep is the backstop for:
+
+- server restarts
+- historical bugs
+- manual DB inconsistencies
+- cases where a run died outside the normal finalization path
+
+## 5.5 Expose the state to operators
+
+Even with auto-recovery, the UI should make the state visible.
+
+Add a derived flag or state in the issue read model, something like:
+
+- `workState = active | queued | orphaned | blocked`
+
+or:
+
+- `needsRecovery = true`
+
+Use that to surface:
+
+- a badge on issue detail and lists when an issue is `in_progress` with no live run
+- a dashboard/inbox count for orphaned assigned work
+
+This is important because the current state is easy to miss: the issue looks "in progress" even when nobody is actually executing it.
+
+## 6. Suggested Implementation Order
+
+## 6.1 Phase 1: continuity on run finalization
+
+Implement the smallest high-confidence fix in `server/src/services/heartbeat.ts`:
+
+- after a run reaches terminal state and issue execution is released/promoted, detect whether the issue is orphaned
+- queue one continuation wake when needed
+- add tests for success, failure, timeout, and cancelled paths where the issue remains `in_progress`
+
+This prevents new stranded issues created by normal run completion.
+
+## 6.2 Phase 2: background sweep
+
+Add a scheduled sweep for existing orphaned issues and for edge cases that bypass normal finalization.
+
+This repairs the current backlog and makes the system robust across restarts.
+
+## 6.3 Phase 3: operator visibility
+
+Expose the derived recovery state in issue APIs and show it in the UI.
+
+This gives humans a direct answer to "what is assigned but not actually being worked right now?"
+
+## 7. Test Plan For The Implementation
+
+The implementation should add focused server tests for:
+
+- a run that ends successfully while the issue remains `in_progress` and assigned queues one continuation wake
+- a run that ends with failure/timeout and leaves the issue orphaned also queues one continuation wake
+- no continuation wake is queued when a deferred wake already exists
+- no duplicate continuation wake is queued when one is already pending
+- the second orphaning event after a continuation retry produces escalation instead of another infinite retry
+- the background sweep recovers a pre-existing orphaned issue
+- paused or terminated assignees are not auto-woken
+
+## 8. Recommendation
+
+The right fix is not automatic reassignment and not silently leaving the issue alone.
+
+The right fix is:
+
+- preserve ownership
+- auto-resume once
+- escalate visibly if continuity still fails
+
+That matches V1's explicit ownership model while closing the current gap where assigned `in_progress` work can stop forever with no signal.