mirror of
https://github.com/alkimake/paperclip.git
synced 2026-06-14 01:50:39 +09:00
238 lines
8.5 KiB
Markdown
238 lines
8.5 KiB
Markdown
# PAP-1231 Agent Browser Process Cleanup Plan
|
|
|
|
Status: Proposed
|
|
Date: 2026-04-08
|
|
Related issue: `PAP-1231`
|
|
Audience: Engineering
|
|
|
|
## Goal
|
|
|
|
Explain why browser processes accumulate during local agent runs and define a cleanup plan that fixes the general process-ownership problem rather than treating `agent-browser` as a one-off.
|
|
|
|
## Short answer
|
|
|
|
Yes, there is a likely root cause in Paperclip's local execution model.
|
|
|
|
Today, heartbeat-run local adapters persist and manage only the top-level spawned PID. Their timeout/cancel path uses direct `child.kill()` semantics. That is weaker than the runtime-service path, which already tracks and terminates whole process groups.
|
|
|
|
If Codex, Claude, Cursor, or a skill launched through them starts Chrome or Chromium helpers, Paperclip can lose ownership of those descendants even when it still believes it handled the run correctly.
|
|
|
|
## Observed implementation facts
|
|
|
|
### 1. Heartbeat-run local adapters track only one PID
|
|
|
|
`packages/adapter-utils/src/server-utils.ts`
|
|
|
|
- `runChildProcess()` spawns the adapter command and records only `child.pid`
|
|
- timeout handling sends `SIGTERM` and then `SIGKILL` to the direct child
|
|
- there is no process-group creation or process-group kill path there today
|
|
|
|
`packages/db/src/schema/heartbeat_runs.ts`
|
|
|
|
- `heartbeat_runs` stores `process_pid`
|
|
- there is no persisted `process_group_id`
|
|
|
|
`server/src/services/heartbeat.ts`
|
|
|
|
- cancellation logic uses the in-memory child handle and calls `child.kill()`
|
|
- orphaned-run recovery checks whether the recorded direct PID is alive
|
|
- the recovery model is built around one tracked process, not a descendant tree
|
|
|
|
### 2. Workspace runtime already uses stronger ownership
|
|
|
|
`server/src/services/workspace-runtime.ts`
|
|
|
|
- runtime services are spawned with `detached: process.platform !== "win32"`
|
|
- the service record stores `processGroupId`
|
|
- shutdown calls `terminateLocalService()` with group-aware killing
|
|
|
|
`server/src/services/local-service-supervisor.ts`
|
|
|
|
- `terminateLocalService()` prefers `process.kill(-processGroupId, signal)` on POSIX
|
|
- it escalates from `SIGTERM` to `SIGKILL`
|
|
|
|
This is the clearest internal comparison point: Paperclip already has one local-process subsystem that treats process-group ownership as the right abstraction.
|
|
|
|
### 3. The current recovery path explains why leaks would be visible but hard to reason about
|
|
|
|
If the direct adapter process exits, hangs, or is cancelled after launching a browser subtree:
|
|
|
|
- Paperclip may think it cancelled the run because the parent process is gone
|
|
- descendant Chrome helpers may still be running
|
|
- orphan recovery has no persisted process-group identity to reconcile or reap later
|
|
|
|
That makes the failure look like an `agent-browser` problem when the more general bug is "executor descendants are not owned strongly enough."
|
|
|
|
## Why `agent-browser` makes the problem obvious
|
|
|
|
Inference:
|
|
|
|
- Chromium is intentionally multi-process
|
|
- browser automation often leaves a browser process plus renderer, GPU, utility, and crashpad/helper children
|
|
- skills that open browsers repeatedly amplify the symptom because each run can produce several descendant processes
|
|
|
|
So `agent-browser` is probably not the root cause. It is the workload that exposes the weak ownership model fastest.
|
|
|
|
## Success condition
|
|
|
|
This work is successful when Paperclip can:
|
|
|
|
1. start a local adapter run and own the full descendant tree it created
|
|
2. cancel, timeout, or recover that run without leaving Chrome descendants behind on POSIX
|
|
3. detect and clean up stale local descendants after server restarts
|
|
4. expose enough metadata that operators can see which run owns which spawned process tree
|
|
|
|
## Non-goals
|
|
|
|
Do not:
|
|
|
|
- special-case `agent-browser` only
|
|
- depend on manual `pkill chrome` cleanup as the primary fix
|
|
- require every skill author to add bespoke browser teardown logic before Paperclip can clean up correctly
|
|
- change remote/http adapter behavior as part of the first pass
|
|
|
|
## Proposed plan
|
|
|
|
### Phase 0: reproduce and instrument
|
|
|
|
Objective:
|
|
|
|
- make the leak measurable from Paperclip's side before changing execution semantics
|
|
|
|
Work:
|
|
|
|
- add a reproducible local test script or fixture that launches a child process which itself launches descendants and ignores normal parent exit
|
|
- capture parent PID, descendant PIDs, and run ID in logs during local adapter execution
|
|
- document current behavior separately for:
|
|
- normal completion
|
|
- timeout
|
|
- explicit cancellation
|
|
- server restart during run
|
|
|
|
Deliverable:
|
|
|
|
- one short repro note attached to the implementation issue or child issue
|
|
|
|
### Phase 1: give heartbeat-run local adapters process-group ownership
|
|
|
|
Objective:
|
|
|
|
- align adapter-run execution with the stronger runtime-service model
|
|
|
|
Work:
|
|
|
|
- update `runChildProcess()` to create a dedicated process group on POSIX
|
|
- persist both:
|
|
- direct PID
|
|
- process-group ID
|
|
- update the run cancellation and timeout paths to kill the group first, then escalate
|
|
- keep direct-PID fallback behavior for platforms where group kill is not available
|
|
|
|
Likely touched surfaces:
|
|
|
|
- `packages/adapter-utils/src/server-utils.ts`
|
|
- `packages/db/src/schema/heartbeat_runs.ts`
|
|
- `packages/shared/src/types/heartbeat.ts`
|
|
- `server/src/services/heartbeat.ts`
|
|
|
|
Important design choice:
|
|
|
|
- use the same ownership model for all local child-process adapters, not just Codex or Claude
|
|
|
|
### Phase 2: make restart recovery group-aware
|
|
|
|
Objective:
|
|
|
|
- prevent stale descendants from surviving server crashes or restarts indefinitely
|
|
|
|
Work:
|
|
|
|
- teach orphan reconciliation to inspect the persisted process-group ID, not only the direct PID
|
|
- if the direct parent is gone but the group still exists, mark the run as detached-orphaned with clearer metadata
|
|
- decide whether restart recovery should:
|
|
- adopt the still-running group, or
|
|
- terminate it as unrecoverable
|
|
|
|
Recommendation:
|
|
|
|
- for heartbeat runs, prefer terminating unrecoverable orphan groups rather than adopting them unless we can prove the adapter session remains safe and observable
|
|
|
|
Reason:
|
|
|
|
- runtime services are long-lived and adoptable
|
|
- heartbeat runs are task executions with stricter audit and cancellation semantics
|
|
|
|
### Phase 3: add operator-visible cleanup tools
|
|
|
|
Objective:
|
|
|
|
- make the system diagnosable when ownership still fails
|
|
|
|
Work:
|
|
|
|
- surface the tracked process metadata in run details or debug endpoints
|
|
- add a control-plane cleanup action or CLI utility for stale local run processes owned by Paperclip
|
|
- scope cleanup by run/agent/company instead of broad browser-name matching
|
|
|
|
This should replace ad hoc scripts as the general-purpose escape hatch.
|
|
|
|
### Phase 4: cover platform and regression cases
|
|
|
|
Objective:
|
|
|
|
- keep the fix from regressing and define platform behavior explicitly
|
|
|
|
Tests to add:
|
|
|
|
- unit tests around process-group-aware cancellation in adapter execution utilities
|
|
- heartbeat recovery tests for:
|
|
- surviving descendant tree after parent loss
|
|
- timeout cleanup
|
|
- cancellation cleanup
|
|
- platform-conditional behavior notes for Windows, where negative-PID group kill does not apply
|
|
|
|
## Recommended first implementation slice
|
|
|
|
The first shipping slice should be narrow:
|
|
|
|
1. introduce process-group ownership for local heartbeat-run adapters on POSIX
|
|
2. persist group metadata on `heartbeat_runs`
|
|
3. switch timeout/cancel paths from direct-child kill to group kill
|
|
4. add one regression test that proves descendants die with the parent run
|
|
|
|
That should address the main Chrome accumulation path without taking on the full restart-recovery design in the same patch.
|
|
|
|
## Risks
|
|
|
|
### 1. Over-killing unrelated processes
|
|
|
|
If process-group boundaries are created incorrectly, cleanup could terminate more than the run owns.
|
|
|
|
Mitigation:
|
|
|
|
- create a fresh process group only for the spawned adapter command
|
|
- persist and target that exact group
|
|
|
|
### 2. Cross-platform differences
|
|
|
|
Windows does not support the POSIX negative-PID kill pattern used elsewhere in the repo.
|
|
|
|
Mitigation:
|
|
|
|
- ship POSIX-first
|
|
- keep direct-child fallback on Windows
|
|
- document Windows as partial until job-object or equivalent handling is designed
|
|
|
|
### 3. Session recovery complexity
|
|
|
|
Adopting a still-running orphaned group may look attractive but can break observability if stdout/stderr pipes are already gone.
|
|
|
|
Mitigation:
|
|
|
|
- default to deterministic cleanup for heartbeat runs unless adoption is explicitly proven safe
|
|
|
|
## Recommendation
|
|
|
|
Treat this as a Paperclip executor ownership bug, not an `agent-browser` bug.
|
|
|
|
`agent-browser` should remain a useful repro case, but the implementation should be shared across all local child-process adapters so any descendant process tree spawned by Codex, Claude, Cursor, Gemini, Pi, or OpenCode is owned and cleaned up consistently.
|