paperclip/doc/plans/2026-04-08-in-progress-issue-recovery.md
2026-04-09 06:14:12 -05:00

8.1 KiB

2026-04-08 In-Progress Issue Recovery Plan

Status: Proposed Date: 2026-04-08 Audience: Product and engineering Related:

  • server/src/services/heartbeat.ts
  • server/src/services/issues.ts
  • server/src/services/issue-assignment-wakeup.ts
  • server/src/routes/issues.ts
  • server/src/__tests__/heartbeat-process-recovery.test.ts
  • server/src/__tests__/issues-checkout-wakeup.test.ts
  • PAP-1227

1. Purpose

This note defines how Paperclip should handle an issue that is:

  • still in_progress
  • still assigned
  • but no longer has anyone actively working on it

The problem is not just stale UI. It is a control-plane gap: the issue still looks owned, but no future wake is guaranteed, so work can stop indefinitely.

2. Current Behavior

Paperclip already has several partial protections:

  • checkout adoption when a stale checkoutRunId points at a terminal or missing run in server/src/services/issues.ts
  • execution lock cleanup when executionRunId points at a non-active run in both issues.ts and heartbeat.ts
  • orphaned local process recovery in heartbeat.reapOrphanedRuns()
  • deferred wake promotion in releaseIssueExecutionAndPromote()
  • one follow-up retry when a run ends without posting an issue comment

What is still missing is a continuity rule for the issue itself.

When a heartbeat run finishes and the issue remains in_progress, Paperclip currently clears executionRunId and may promote an already-deferred wake. If there is no deferred wake, the issue is simply left assigned and in_progress.

That means an issue can legitimately end up in this state:

  • status = in_progress
  • assigneeAgentId != null
  • executionRunId = null
  • checkoutRunId points at an old finished run, or is otherwise stale
  • no queued/running wake exists for the issue

At that point, nothing automatically resumes the work.

3. Root Cause

The system enforces comment continuity, but not execution continuity.

Today the lifecycle is effectively:

  1. wake the assignee
  2. run one heartbeat
  3. require a comment
  4. stop unless some other event happens

That is fine for tasks that move themselves to done, blocked, or in_review in one heartbeat. It fails for work that legitimately spans multiple heartbeats but does not produce a new external trigger.

This is why the issue can "just sit there": there is no invariant saying "in_progress must imply an active run, a queued continuation, or an explicit waiting state."

4. Desired Invariant

For an assigned issue, in_progress should mean one of these is true:

  1. there is an active execution run for the issue
  2. there is a queued/deferred wake that will resume the issue soon
  3. the system has exhausted bounded automatic recovery and has surfaced the issue for explicit human/agent intervention

What must not be allowed as a steady state is:

  • assigned
  • in_progress
  • no active run
  • no queued continuation
  • no visible escalation

5. Proposed Plan

5.1 Add a first-class orphaned-issue detector

Introduce a shared helper that identifies an "orphaned in-progress issue":

  • status === "in_progress"
  • assigneeAgentId is present
  • no queued/running run currently owns the issue
  • no deferred wake already exists for the issue
  • checkoutRunId is null, missing, or points at a terminal/missing run

This should live close to the existing issue/run ownership logic so the rules do not diverge.

5.2 Queue one automatic continuation wake

When a run finishes, after execution-lock release and deferred-wake promotion, check whether the linked issue is now orphaned.

If it is, queue exactly one automatic continuation wake for the same assignee.

Important constraints:

  • do not reassign the issue; V1 explicitly avoids automatic reassignment
  • do not reset the issue back to todo; it is still owned work
  • do not create duplicate queued continuation wakes if one already exists
  • keep using the existing stale-checkout adoption path so the next run can legally reclaim the old checkout

Suggested wake reason:

  • issue_continuation_needed

Suggested payload/context fields:

  • issueId
  • retryOfRunId
  • wakeReason = "issue_continuation_needed"
  • retryReason = "issue_continuation_needed"

5.3 Bound retries and escalate explicitly

The continuation wake must be bounded.

Recommended rule:

  • first orphaning event: queue one automatic continuation wake
  • if the continuation wake also ends and the issue is still orphaned: stop retrying automatically and surface the problem

Escalation behavior:

  • add an issue comment explaining that work is still in_progress but no live run remains
  • keep the assignee unchanged
  • move the issue to blocked only if we want strict workflow semantics for "waiting on intervention"

My recommendation is:

  • keep the first recovery silent except for activity/run events
  • on exhaustion, add a comment and set status = blocked

That creates a visible operator queue instead of leaving the issue silently stranded.

5.4 Add a background sweep for legacy stranded issues

Run finalization fixes future cases, but it does not repair issues already stranded in existing data.

Add a periodic sweep, alongside other heartbeat housekeeping, that finds issues already matching the orphaned condition and applies the same recovery path.

This sweep should:

  • skip issues that already have a queued continuation wake
  • skip issues whose assignee is paused/terminated/pending approval
  • queue a continuation wake when safe
  • otherwise add a visible escalation comment and/or mark blocked

This sweep is the backstop for:

  • server restarts
  • historical bugs
  • manual DB inconsistencies
  • cases where a run died outside the normal finalization path

5.5 Expose the state to operators

Even with auto-recovery, the UI should make the state visible.

Add a derived flag or state in the issue read model, something like:

  • workState = active | queued | orphaned | blocked

or:

  • needsRecovery = true

Use that to surface:

  • a badge on issue detail and lists when an issue is in_progress with no live run
  • a dashboard/inbox count for orphaned assigned work

This is important because the current state is easy to miss: the issue looks "in progress" even when nobody is actually executing it.

6. Suggested Implementation Order

6.1 Phase 1: continuity on run finalization

Implement the smallest high-confidence fix in server/src/services/heartbeat.ts:

  • after a run reaches terminal state and issue execution is released/promoted, detect whether the issue is orphaned
  • queue one continuation wake when needed
  • add tests for success, failure, timeout, and cancelled paths where the issue remains in_progress

This prevents new stranded issues created by normal run completion.

6.2 Phase 2: background sweep

Add a scheduled sweep for existing orphaned issues and for edge cases that bypass normal finalization.

This repairs the current backlog and makes the system robust across restarts.

6.3 Phase 3: operator visibility

Expose the derived recovery state in issue APIs and show it in the UI.

This gives humans a direct answer to "what is assigned but not actually being worked right now?"

7. Test Plan For The Implementation

The implementation should add focused server tests for:

  • a run that ends successfully while the issue remains in_progress and assigned queues one continuation wake
  • a run that ends with failure/timeout and leaves the issue orphaned also queues one continuation wake
  • no continuation wake is queued when a deferred wake already exists
  • no duplicate continuation wake is queued when one is already pending
  • the second orphaning event after a continuation retry produces escalation instead of another infinite retry
  • the background sweep recovers a pre-existing orphaned issue
  • paused or terminated assignees are not auto-woken

8. Recommendation

The right fix is not automatic reassignment and not silently leaving the issue alone.

The right fix is:

  • preserve ownership
  • auto-resume once
  • escalate visibly if continuity still fails

That matches V1's explicit ownership model while closing the current gap where assigned in_progress work can stop forever with no signal.