[codex] Harden execution reliability and heartbeat tooling (#3679)

## Thinking Path

> - Paperclip orchestrates AI agents for zero-human companies
> - Reliable execution depends on heartbeat routing, issue lifecycle
semantics, telemetry, and a fast enough local verification loop to keep
regressions visible
> - The remaining commits on this branch were mostly server/runtime
correctness fixes plus test and documentation follow-ups in that area
> - Those changes are logically separate from the UI-focused
issue-detail and workspace/navigation branches even when they touch
overlapping issue APIs
> - This pull request groups the execution reliability, heartbeat,
telemetry, and tooling changes into one standalone branch
> - The benefit is a focused review of the control-plane correctness
work, including the follow-up fix that restored the implicit
comment-reopen helpers after branch splitting

## What Changed

- Hardened issue/heartbeat execution behavior, including self-review
stage skipping, deferred mention wakes during active execution, stranded
execution recovery, active-run scoping, assignee resolution, and
blocked-to-todo wake resumption
- Reduced noisy polling/logging overhead by trimming issue run payloads,
compacting persisted run logs, silencing high-volume request logs, and
capping heartbeat-run queries in dashboard/inbox surfaces
- Expanded telemetry and status semantics with adapter/model fields on
task completion plus clearer status guidance in docs/onboarding material
- Updated test infrastructure and verification defaults with faster
route-test module isolation, cheaper default `pnpm test`, e2e isolation
from local state, and repo verification follow-ups
- Included docs/release housekeeping from the branch and added a small
follow-up commit restoring the implicit comment-reopen helpers that were
dropped during branch reconstruction

## Verification

- `pnpm vitest run
server/src/__tests__/issue-comment-reopen-routes.test.ts
server/src/__tests__/issue-telemetry-routes.test.ts`
- `pnpm vitest run server/src/__tests__/http-log-policy.test.ts
server/src/__tests__/heartbeat-run-log.test.ts
server/src/__tests__/health.test.ts`
- `server/src/__tests__/activity-service.test.ts`,
`server/src/__tests__/heartbeat-comment-wake-batching.test.ts`, and
`server/src/__tests__/heartbeat-process-recovery.test.ts` were attempted
on this host but the embedded Postgres harness reported
init-script/data-dir problems and skipped or failed to start, so they
are noted as environment-limited

## Risks

- Medium: this branch changes core issue/heartbeat routing and
reopen/wakeup behavior, so regressions would affect agent execution flow
rather than isolated UI polish
- Because it also updates verification infrastructure, reviewers should
pay attention to whether the new tests are asserting the right failure
modes and not just reshaping harness behavior

## Model Used

- OpenAI Codex coding agent (GPT-5-class runtime in Codex CLI; exact
deployed model ID is not exposed in this environment), reasoning
enabled, tool use and local code execution enabled

## Checklist

- [x] I have included a thinking path that traces from project context
to this change
- [x] I have specified the model used (with version and capability
details)
- [ ] I have run tests locally and they pass
- [x] I have added or updated tests where applicable
- [ ] If this change affects the UI, I have included before/after
screenshots
- [x] I have updated relevant documentation to reflect my changes
- [x] I have considered and documented any risks above
- [x] I will address all Greptile and reviewer comments before
requesting merge

---------

Co-authored-by: Paperclip <noreply@paperclip.ing>
This commit is contained in:
Dotta 2026-04-14 13:34:52 -05:00 committed by GitHub
parent e89076148a
commit 7f893ac4ec
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
106 changed files with 4682 additions and 713 deletions

View file

@ -79,6 +79,29 @@ Allow additional private hostnames (for example custom Tailscale hostnames):
pnpm paperclipai allowed-hostname dotta-macbook-pro
```
## Test Commands
Use the cheap local default unless you are specifically working on browser flows:
```sh
pnpm test
```
`pnpm test` runs the Vitest suite only. For interactive Vitest watch mode use:
```sh
pnpm test:watch
```
Browser suites stay separate:
```sh
pnpm test:e2e
pnpm test:release-smoke
```
These browser suites are intended for targeted local verification and CI, not the default agent/human test command.
## One-Command Local Run
For a first-time local install, you can bootstrap and run in one command:

View file

@ -395,6 +395,8 @@ Side effects:
- entering `done` sets `completed_at`
- entering `cancelled` sets `cancelled_at`
Detailed ownership, execution, blocker, and crash-recovery semantics are documented in `doc/execution-semantics.md`.
## 8.3 Approval Status
- `pending -> approved | rejected | cancelled`

252
doc/execution-semantics.md Normal file
View file

@ -0,0 +1,252 @@
# Execution Semantics
Status: Current implementation guide
Date: 2026-04-13
Audience: Product and engineering
This document explains how Paperclip interprets issue assignment, issue status, execution runs, wakeups, parent/sub-issue structure, and blocker relationships.
`doc/SPEC-implementation.md` remains the V1 contract. This document is the detailed execution model behind that contract.
## 1. Core Model
Paperclip separates four concepts that are easy to blur together:
1. structure: parent/sub-issue relationships
2. dependency: blocker relationships
3. ownership: who is responsible for the issue now
4. execution: whether the control plane currently has a live path to move the issue forward
The system works best when those are kept separate.
## 2. Assignee Semantics
An issue has at most one assignee.
- `assigneeAgentId` means the issue is owned by an agent
- `assigneeUserId` means the issue is owned by a human board user
- both cannot be set at the same time
This is a hard invariant. Paperclip is single-assignee by design.
## 3. Status Semantics
Paperclip issue statuses are not just UI labels. They imply different expectations about ownership and execution.
### `backlog`
The issue is not ready for active work.
- no execution expectation
- no pickup expectation
- safe resting state for future work
### `todo`
The issue is actionable but not actively claimed.
- it may be assigned or unassigned
- no checkout/execution lock is required yet
- for agent-assigned work, Paperclip may still need a wake path to ensure the assignee actually sees it
### `in_progress`
The issue is actively owned work.
- requires an assignee
- for agent-owned issues, this is a strict execution-backed state
- for user-owned issues, this is a human ownership state and is not backed by heartbeat execution
For agent-owned issues, `in_progress` should not be allowed to become a silent dead state.
### `blocked`
The issue cannot proceed until something external changes.
This is the right state for:
- waiting on another issue
- waiting on a human decision
- waiting on an external dependency or system
- work that automatic recovery could not safely continue
### `in_review`
Execution work is paused because the next move belongs to a reviewer or approver, not the current executor.
### `done`
The work is complete and terminal.
### `cancelled`
The work will not continue and is terminal.
## 4. Agent-Owned vs User-Owned Execution
The execution model differs depending on assignee type.
### Agent-owned issues
Agent-owned issues are part of the control plane's execution loop.
- Paperclip can wake the assignee
- Paperclip can track runs linked to the issue
- Paperclip can recover some lost execution state after crashes/restarts
### User-owned issues
User-owned issues are not executed by the heartbeat scheduler.
- Paperclip can track the ownership and status
- Paperclip cannot rely on heartbeat/run semantics to keep them moving
- stranded-work reconciliation does not apply to them
This is why `in_progress` can be strict for agents without forcing the same runtime rules onto human-held work.
## 5. Checkout and Active Execution
Checkout is the bridge from issue ownership to active agent execution.
- checkout is required to move an issue into agent-owned `in_progress`
- `checkoutRunId` represents issue-ownership lock for the current agent run
- `executionRunId` represents the currently active execution path for the issue
These are related but not identical:
- `checkoutRunId` answers who currently owns execution rights for the issue
- `executionRunId` answers which run is actually live right now
Paperclip already clears stale execution locks and can adopt some stale checkout locks when the original run is gone.
## 6. Parent/Sub-Issue vs Blockers
Paperclip uses two different relationships for different jobs.
### Parent/Sub-Issue (`parentId`)
This is structural.
Use it for:
- work breakdown
- rollup context
- explaining why a child issue exists
- waking the parent assignee when all direct children become terminal
Do not treat `parentId` as execution dependency by itself.
### Blockers (`blockedByIssueIds`)
This is dependency semantics.
Use it for:
- \"this issue cannot continue until that issue changes state\"
- explicit waiting relationships
- automatic wakeups when all blockers resolve
If a parent is truly waiting on a child, model that with blockers. Do not rely on the parent/child relationship alone.
## 7. Consistent Execution Path Rules
For agent-assigned, non-terminal, actionable issues, Paperclip should not leave work in a state where nobody is working it and nothing will wake it.
The relevant execution path depends on status.
### Agent-assigned `todo`
This is dispatch state: ready to start, not yet actively claimed.
A healthy dispatch state means at least one of these is true:
- the issue already has a queued/running wake path
- the issue is intentionally resting in `todo` after a successful agent heartbeat, not after an interrupted dispatch
- the issue has been explicitly surfaced as stranded
### Agent-assigned `in_progress`
This is active-work state.
A healthy active-work state means at least one of these is true:
- there is an active run for the issue
- there is already a queued continuation wake
- the issue has been explicitly surfaced as stranded
## 8. Crash and Restart Recovery
Paperclip now treats crash/restart recovery as a stranded-assigned-work problem, not just a stranded-run problem.
There are two distinct failure modes.
### 8.1 Stranded assigned `todo`
Example:
- issue is assigned to an agent
- status is `todo`
- the original wake/run died during or after dispatch
- after restart there is no queued wake and nothing picks the issue back up
Recovery rule:
- if the latest issue-linked run failed/timed out/cancelled and no live execution path remains, Paperclip queues one automatic assignment recovery wake
- if that recovery wake also finishes and the issue is still stranded, Paperclip moves the issue to `blocked` and posts a visible comment
This is a dispatch recovery, not a continuation recovery.
### 8.2 Stranded assigned `in_progress`
Example:
- issue is assigned to an agent
- status is `in_progress`
- the live run disappeared
- after restart there is no active run and no queued continuation
Recovery rule:
- Paperclip queues one automatic continuation wake
- if that continuation wake also finishes and the issue is still stranded, Paperclip moves the issue to `blocked` and posts a visible comment
This is an active-work continuity recovery.
## 9. Startup and Periodic Reconciliation
Startup recovery and periodic recovery are different from normal wakeup delivery.
On startup and on the periodic recovery loop, Paperclip now does three things in sequence:
1. reap orphaned `running` runs
2. resume persisted `queued` runs
3. reconcile stranded assigned work
That last step is what closes the gap where issue state survives a crash but the wake/run path does not.
## 10. What This Does Not Mean
These semantics do not change V1 into an auto-reassignment system.
Paperclip still does not:
- automatically reassign work to a different agent
- infer dependency semantics from `parentId` alone
- treat human-held work as heartbeat-managed execution
The recovery model is intentionally conservative:
- preserve ownership
- retry once when the control plane lost execution continuity
- escalate visibly when the system cannot safely keep going
## 11. Practical Interpretation
For a board operator, the intended meaning is:
- agent-owned `in_progress` should mean \"this is live work or clearly surfaced as a problem\"
- agent-owned `todo` should not stay assigned forever after a crash with no remaining wake path
- parent/sub-issue explains structure
- blockers explain waiting
That is the execution contract Paperclip should present to operators.

View file

@ -0,0 +1,382 @@
# VS Code Task Interoperability Plan
Status: planning only, no code changes
Date: 2026-04-12
Related issue: `PAP-1377`
## Summary
Paperclip should not replace its workspace runtime service model with VS Code tasks.
It should add a narrow interoperability layer that can discover and adopt supported entries from `.vscode/tasks.json`.
The core product model should stay:
- Paperclip owns long-running workspace services and their desired state
- Paperclip shows operators exactly which named thing they are starting or stopping
- Paperclip distinguishes long-running services from one-shot jobs
VS Code tasks should be treated as:
- an import/discovery format for workspace commands
- a convenience for repos that already maintain `tasks.json`
- a partial compatibility layer, not a full execution model
## Current State
The current implementation is already service-oriented:
- project workspaces and execution workspaces can store `workspaceRuntime` config plus `desiredState` and per-service `serviceStates`
- the UI renders one control row per configured service and persists start/stop intent
- the backend supervises long-running local processes, reuses eligible services, and restores desired services on startup
Relevant files:
- `packages/shared/src/types/workspace-runtime.ts`
- `server/src/services/workspace-runtime.ts`
- `server/src/services/project-workspace-runtime-config.ts`
- `ui/src/components/WorkspaceRuntimeControls.tsx`
- `ui/src/pages/ProjectWorkspaceDetail.tsx`
- `ui/src/pages/ExecutionWorkspaceDetail.tsx`
This is directionally correct for Paperclip because it gives the control plane an explicit model for service lifecycle, health, reuse, and restart behavior.
## Problem To Solve
The current UX is still too raw:
- operators have to hand-author runtime JSON
- a workspace can have multiple attached services, but the higher-level intent is not obvious
- start/stop controls are visible in multiple places, which makes it easy to lose track of what is being controlled
- there is no interoperability with repos that already define useful local workflows in `.vscode/tasks.json`
The issue is not that services are the wrong abstraction.
The issue is that the configuration surface is too low-level and Paperclip does not yet leverage existing workspace metadata.
## Recommendation
Keep Paperclip runtime services as the source of truth for service supervision.
Add a new workspace command model above the raw JSON layer, with VS Code task discovery as one input.
The product model should become:
1. `Workspace command`
A named runnable thing attached to a workspace.
2. `Workspace service`
A workspace command that is expected to stay alive and be supervised.
3. `Workspace job`
A workspace command that runs once and exits.
4. `Runtime service instance`
The live process record that already exists today in Paperclip.
In that model, VS Code tasks are a way to populate workspace commands.
Only commands that map cleanly to Paperclip service or job semantics should become runnable in Paperclip.
## Why Not Fully Adopt VS Code Tasks
VS Code tasks are broader than Paperclip runtime services.
They include shell/process tasks, compound tasks, background/watch tasks, presentation settings, extension/task-provider types, variable substitution, and problem-matcher-driven lifecycle.
That creates a bad fit if Paperclip tries to use `tasks.json` as its only runtime model:
- many tasks are one-shot jobs, not long-running services
- some tasks depend on VS Code task providers or editor-only variable resolution
- compound task graphs are useful, but they are not the same thing as a supervised service
- problem matcher readiness is useful metadata, but it is not enough to replace Paperclip's persisted service lifecycle model
The right boundary is interoperability, not replacement.
## Interoperability Contract
Paperclip should support a conservative subset of VS Code tasks and clearly mark unsupported entries.
### Supported in phase 1
- `shell` and `process` tasks with a concrete command Paperclip can resolve
- optional task `options.cwd`
- optional task environment values that can be flattened safely
- task labels and detail text for naming and display
- `dependsOn` for import-time expansion or display-only dependency hints
- background/watch-oriented tasks that can reasonably be treated as long-running services
### Maybe supported in later phases
- grouping and default task metadata for better UX
- selected variable substitution when Paperclip can resolve it safely from workspace context
- mapping task metadata into Paperclip readiness/expose hints
- limited compound-task launch flows
### Not supported initially
- extension-provided task types Paperclip cannot execute directly
- arbitrary VS Code variable substitution semantics
- problem matcher parsing as the main source of service health
- full parity with VS Code task execution behavior
## Long-Running Service Detection
Paperclip needs an explicit classification layer instead of assuming every VS Code task is a service.
Recommended classification:
- `service`
Explicitly marked by Paperclip metadata, or confidently inferred from background/watch task semantics
- `job`
One-shot command expected to exit
- `unsupported`
Present in `tasks.json`, but not safely runnable by Paperclip
The important product decision is that service classification must be visible and editable by the operator.
Inference can help, but it should not be the only source of truth.
## Proposed Product Shape
### 1. Replace raw-first editing with command-first editing
Project and execution workspace pages should stop making raw runtime JSON the primary editing surface.
Default UI should show:
- workspace commands
- command type: service or job
- source: Paperclip or VS Code
- exact command and cwd
- current state for services
- explicit start, stop, restart, and run-now actions
Raw JSON should remain available behind an advanced section.
### 2. Add VS Code task discovery on workspaces
For a workspace with `cwd`, Paperclip should look for `.vscode/tasks.json`.
The workspace UI should show:
- whether a `tasks.json` file was found
- last parse time
- supported commands discovered
- unsupported tasks with reasons
- whether commands are inherited into execution workspaces
### 3. Make the controlled thing explicit
Start and stop UI should always name the exact entry being controlled.
Examples:
- `Start web`
- `Stop api`
- `Run db:migrate`
Avoid generic workspace-level labels when multiple commands exist.
### 4. Separate services from jobs in the UI
Do not mix one-shot jobs and long-running services into one undifferentiated list.
Recommended sections:
- `Services`
- `Jobs`
- `Unsupported imported tasks`
That resolves the ambiguity called out in the issue.
## Data Model Direction
Do not replace `workspaceRuntime` immediately.
Instead add a higher-level representation that can compile down to the existing runtime-service machinery.
Suggested workspace metadata shape:
```ts
type WorkspaceCommandSource =
| { type: "paperclip" }
| { type: "vscode_task"; taskLabel: string; taskPath: ".vscode/tasks.json" };
type WorkspaceCommandKind = "service" | "job";
type WorkspaceCommandDefinition = {
id: string;
name: string;
kind: WorkspaceCommandKind;
source: WorkspaceCommandSource;
command: string | null;
cwd: string | null;
env?: Record<string, string> | null;
autoStart?: boolean;
serviceConfig?: {
lifecycle?: "shared" | "ephemeral";
reuseScope?: "project_workspace" | "execution_workspace" | "run";
readiness?: Record<string, unknown> | null;
expose?: Record<string, unknown> | null;
} | null;
importWarnings?: string[];
disabledReason?: string | null;
};
```
`workspaceRuntime` can then become a derived or advanced representation for service-type commands until the rest of the system is migrated.
## VS Code Mapping Rules
Paperclip should map imported tasks with explicit, documented rules.
Recommended rules:
1. A task becomes a `job` by default.
2. A task becomes a `service` only when:
- Paperclip metadata marks it as a service, or
- the task clearly represents a background/watch process and the operator confirms the classification.
3. Unsupported tasks stay visible but disabled.
4. Task labels become default command names.
5. `dependsOn` is preserved as metadata, not silently flattened into hidden behavior.
Paperclip-specific metadata can live in a namespaced field on the imported task definition, for example:
```json
{
"label": "web",
"type": "shell",
"command": "pnpm dev",
"isBackground": true,
"paperclip": {
"kind": "service",
"readiness": {
"type": "http",
"urlTemplate": "http://127.0.0.1:${port}"
},
"expose": {
"type": "url",
"urlTemplate": "http://127.0.0.1:${port}"
}
}
}
```
That gives us interoperability without depending on VS Code-only semantics for service readiness and exposure.
## Execution Policy
Project workspaces should be the main place where imported commands are discovered and curated.
Execution workspaces should inherit that curated command set by default, with optional issue-level overrides.
Recommended precedence:
1. execution workspace override
2. project workspace command set
3. imported VS Code tasks from the linked workspace
4. advanced raw runtime fallback
This matches the existing direction in `doc/plans/2026-03-10-workspace-strategy-and-git-worktrees.md`.
## Implementation Plan
### Phase 1: Discovery and read-only visibility
Goal:
show imported VS Code tasks in the workspace UI without changing runtime behavior.
Work:
- parse `.vscode/tasks.json` for project workspaces with local `cwd`
- derive a list of candidate commands plus unsupported items
- show source, label, command, cwd, and classification
- show parse warnings and unsupported reasons
Success condition:
an operator can see what Paperclip would import and why.
### Phase 2: Command model and explicit classification
Goal:
introduce a first-class workspace command layer above raw runtime JSON.
Work:
- add a persisted command definition model in workspace metadata or a dedicated table
- allow operator edits to imported command classification
- separate `service` and `job` in UI
- keep existing runtime-service storage for live supervised processes
Success condition:
the workspace UI is command-first, and raw runtime JSON is advanced-only.
### Phase 3: Service execution backed by existing runtime supervisor
Goal:
run supported imported service commands through the current Paperclip supervisor.
Work:
- compile service commands into the existing runtime service start/stop path
- persist desired state per named command
- keep startup restoration behavior for service commands
- make the active command name explicit everywhere control actions appear
Success condition:
imported service commands behave like native Paperclip services once adopted.
### Phase 4: Job execution and optional dependency handling
Goal:
support one-shot imported commands without pretending they are services.
Work:
- add `Run` actions for jobs
- record output in workspace operations
- optionally support simple `dependsOn` execution for jobs with clear logging
Success condition:
one-shot tasks are runnable, but they are not mixed into the service lifecycle model.
### Phase 5: Adapter and execution workspace integration
Goal:
let agents and issue-scoped workspaces consume the curated command model consistently.
Work:
- expose inherited workspace commands to execution workspaces
- allow issue-level selection of a default service command when relevant
- make service selection explicit in issue and workspace views
Success condition:
agents, operators, and workspaces all refer to the same named commands.
## Non-Goals
- full VS Code task-runner parity
- support for every VS Code task type
- removal of Paperclip's own runtime supervision model
- editor-dependent execution semantics inside the control plane
## Risks
- overfitting Paperclip to VS Code and making the model worse for non-VS-Code repos
- misclassifying watch tasks as durable services
- hiding too much detail and making debugging harder
- allowing imported task graphs to become implicit magic
These risks are manageable if the import layer stays explicit, conservative, and operator-editable.
## Decision
Paperclip should adopt VS Code tasks as an optional workspace command source, not as the canonical runtime model.
The main UX change should be:
- move from raw runtime JSON to named workspace commands
- separate services from jobs
- make the exact controlled command explicit
- let `.vscode/tasks.json` pre-populate those commands when available
## External References
- VS Code tasks documentation: https://code.visualstudio.com/docs/debugtest/tasks
- Existing Paperclip workspace plan: `doc/plans/2026-03-10-workspace-strategy-and-git-worktrees.md`