paperclip/doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md

# Token Optimization Plan

Date: 2026-03-13  
Related discussion: https://github.com/paperclipai/paperclip/discussions/449

## Goal

Reduce token consumption materially without reducing agent capability, control-plane visibility, or task completion quality.

This plan is based on:

- the current V1 control-plane design
- the current adapter and heartbeat implementation
- the linked user discussion
- local runtime data from the default Paperclip instance on 2026-03-13

## Executive Summary

The discussion is directionally right about two things:

1. We should preserve session and prompt-cache locality more aggressively.
2. We should separate stable startup instructions from per-heartbeat dynamic context.

But that is not enough on its own.

After reviewing the code and local run data, the token problem appears to have four distinct causes:

1. **Measurement inflation on sessioned adapters.** Some token counters, especially for `codex_local`, appear to be recorded as cumulative session totals instead of per-heartbeat deltas.
2. **Avoidable session resets.** Task sessions are intentionally reset on timer wakes and manual wakes, which destroys cache locality for common heartbeat paths.
3. **Repeated context reacquisition.** The `paperclip` skill tells agents to re-fetch assignments, issue details, ancestors, and full comment threads on every heartbeat. The API does not currently offer efficient delta-oriented alternatives.
4. **Large static instruction surfaces.** Agent instruction files and globally injected skills are reintroduced at startup even when most of that content is unchanged and not needed for the current task.

The correct approach is:

1. fix telemetry so we can trust the numbers
2. preserve reuse where it is safe
3. make context retrieval incremental
4. add session compaction/rotation so long-lived sessions do not become progressively more expensive

## Validated Findings

### 1. Token telemetry is at least partly overstated today

Observed from the local default instance:

- `heartbeat_runs`: 11,360 runs between 2026-02-18 and 2026-03-13
- summed `usage_json.inputTokens`: `2,272,142,368,952`
- summed `usage_json.cachedInputTokens`: `2,217,501,559,420`

Those totals are not credible as true per-heartbeat usage for the observed prompt sizes.

Supporting evidence:

- `adapter.invoke.payload.prompt` averages were small:
  - `codex_local`: ~193 chars average, 6,067 chars max
  - `claude_local`: ~160 chars average, 1,160 chars max
- despite that, many `codex_local` runs report millions of input tokens
- one reused Codex session in local data spans 3,607 runs and recorded `inputTokens` growing up to `1,155,283,166`

Interpretation:

- for sessioned adapters, especially Codex, we are likely storing usage reported by the runtime as a **session total**, not a **per-run delta**
- this makes trend reporting, optimization work, and customer trust worse

This does **not** mean there is no real token problem. It means we need a trustworthy baseline before we can judge optimization impact.

### 2. Timer wakes currently throw away reusable task sessions

In `server/src/services/heartbeat.ts`, `shouldResetTaskSessionForWake(...)` returns `true` for:

- `wakeReason === "issue_assigned"`
- `wakeSource === "timer"`
- manual on-demand wakes

That means many normal heartbeats skip saved task-session resume even when the workspace is stable.

Local data supports the impact:

- `timer/system` runs: 6,587 total
- only 976 had a previous session
- only 963 ended with the same session

So timer wakes are the largest heartbeat path and are mostly not resuming prior task state.

### 3. We repeatedly ask agents to reload the same task context

The `paperclip` skill currently tells agents to do this on essentially every heartbeat:

- fetch assignments
- fetch issue details
- fetch ancestor chain
- fetch full issue comments

Current API shape reinforces that pattern:

- `GET /api/issues/:id/comments` returns the full thread
- there is no `since`, cursor, digest, or summary endpoint for heartbeat consumption
- `GET /api/issues/:id` returns full enriched issue context, not a minimal delta payload

This is safe but expensive. It forces the model to repeatedly consume unchanged information.

### 4. Static instruction payloads are not separated cleanly from dynamic heartbeat prompts

The user discussion suggested a bootstrap prompt. That is the right direction.

Current state:

- the UI exposes `bootstrapPromptTemplate`
- adapter execution paths do not currently use it
- several adapters prepend `instructionsFilePath` content directly into the per-run prompt or system prompt

Result:

- stable instructions are re-sent or re-applied in the same path as dynamic heartbeat content
- we are not deliberately optimizing for provider prompt caching

### 5. We inject more skill surface than most agents need

Local adapters inject repo skills into runtime skill directories.

Current repo skill sizes:

- `skills/paperclip/SKILL.md`: 17,441 bytes
- `.agents/skills/create-agent-adapter/SKILL.md`: 31,832 bytes
- `skills/paperclip-create-agent/SKILL.md`: 4,718 bytes
- `skills/para-memory-files/SKILL.md`: 3,978 bytes

That is nearly 58 KB of skill markdown before any company-specific instructions.

Not all of that is necessarily loaded into model context every run, but it increases startup surface area and should be treated as a token budget concern.

## Principles

We should optimize tokens under these rules:

1. **Do not lose functionality.** Agents must still be able to resume work safely, understand why tasks exist, and act within governance rules.
2. **Prefer stable context over repeated context.** Unchanged instructions should not be resent through the most expensive path.
3. **Prefer deltas over full reloads.** Heartbeats should consume only what changed since the last useful run.
4. **Measure normalized deltas, not raw adapter claims.** Especially for sessioned CLIs.
5. **Keep escape hatches.** Board/manual runs may still want a forced fresh session.

## Plan

## Phase 1: Make token telemetry trustworthy

This should happen first.

### Changes

- Store both:
  - raw adapter-reported usage
  - Paperclip-normalized per-run usage
- For sessioned adapters, compute normalized deltas against prior usage for the same persisted session.
- Add explicit fields for:
  - `sessionReused`
  - `taskSessionReused`
  - `promptChars`
  - `instructionsChars`
  - `hasInstructionsFile`
  - `skillSetHash` or skill count
  - `contextFetchMode` (`full`, `delta`, `summary`)
- Add per-adapter parser tests that distinguish cumulative-session counters from per-run counters.

### Why

Without this, we cannot tell whether a reduction came from a real optimization or a reporting artifact.

### Success criteria

- per-run token totals stop exploding on long-lived sessions
- a resumed session’s usage curve is believable and monotonic at the session level, but not double-counted at the run level
- cost pages can show both raw and normalized numbers while we migrate

## Phase 2: Preserve safe session reuse by default

This is the highest-leverage behavior change.

### Changes

- Stop resetting task sessions on ordinary timer wakes.
- Keep resetting on:
  - explicit manual “fresh run” invocations
  - assignment changes
  - workspace mismatch
  - model mismatch / invalid resume errors
- Add an explicit wake flag like `forceFreshSession: true` when the board wants a reset.
- Record why a session was reused or reset in run metadata.

### Why

Timer wakes are the dominant heartbeat path. Resetting them destroys both session continuity and prompt cache reuse.

### Success criteria

- timer wakes resume the prior task session in the large majority of stable-workspace cases
- no increase in stale-session failures
- lower normalized input tokens per timer heartbeat

## Phase 3: Separate static bootstrap context from per-heartbeat context

This is the right version of the discussion’s bootstrap idea.

### Changes

- Implement `bootstrapPromptTemplate` in adapter execution paths.
- Use it only when starting a fresh session, not on resumed sessions.
- Keep `promptTemplate` intentionally small and stable:
  - who I am
  - what triggered this wake
  - which task/comment/approval to prioritize
- Move long-lived setup text out of recurring per-run prompts where possible.
- Add UI guidance and warnings when `promptTemplate` contains high-churn or large inline content.

### Why

Static instructions and dynamic wake context have different cache behavior and should be modeled separately.

### Success criteria

- fresh-session prompts can remain richer without inflating every resumed heartbeat
- resumed prompts become short and structurally stable
- cache hit rates improve for session-preserving adapters

## Phase 4: Make issue/task context incremental

This is the biggest product change and likely the biggest real token saver after session reuse.

### Changes

Add heartbeat-oriented endpoints and skill behavior:

- `GET /api/agents/me/inbox-lite`
  - minimal assignment list
  - issue id, identifier, status, priority, updatedAt, lastExternalCommentAt
- `GET /api/issues/:id/heartbeat-context`
  - compact issue state
  - parent-chain summary
  - latest execution summary
  - change markers
- `GET /api/issues/:id/comments?after=<cursor>` or `?since=<timestamp>`
  - return only new comments
- optional `GET /api/issues/:id/context-digest`
  - server-generated compact summary for heartbeat use

Update the `paperclip` skill so the default pattern becomes:

1. fetch compact inbox
2. fetch compact task context
3. fetch only new comments unless this is the first read, a mention-triggered wake, or a cache miss
4. fetch full thread only on demand

### Why

Today we are using full-fidelity board APIs as heartbeat APIs. That is convenient but token-inefficient.

### Success criteria

- after first task acquisition, most heartbeats consume only deltas
- repeated blocked-task or long-thread work no longer replays the whole comment history
- mention-triggered wakes still have enough context to respond correctly

## Phase 5: Add session compaction and controlled rotation

This protects against long-lived session bloat.

### Changes

- Add rotation thresholds per adapter/session:
  - turns
  - normalized input tokens
  - age
  - cache hit degradation
- Before rotating, produce a structured carry-forward summary:
  - current objective
  - work completed
  - open decisions
  - blockers
  - files/artifacts touched
  - next recommended action
- Persist that summary in task session state or runtime state.
- Start the next session with:
  - bootstrap prompt
  - compact carry-forward summary
  - current wake trigger

### Why

Even when reuse is desirable, some sessions become too expensive to keep alive indefinitely.

### Success criteria

- very long sessions stop growing without bound
- rotating a session does not cause loss of task continuity
- successful task completion rate stays flat or improves

## Phase 6: Reduce unnecessary skill surface

### Changes

- Move from “inject all repo skills” to an allowlist per agent or per adapter.
- Default local runtime skill set should likely be:
  - `paperclip`
- Add opt-in skills for specialized agents:
  - `paperclip-create-agent`
  - `para-memory-files`
  - `create-agent-adapter`
- Expose active skill set in agent config and run metadata.

### Why

Most agents do not need adapter-authoring or memory-system skills on every run.

### Success criteria

- smaller startup instruction surface
- no loss of capability for specialist agents that explicitly need extra skills

## Rollout Order

Recommended order:

1. telemetry normalization
2. timer-wake session reuse
3. bootstrap prompt implementation
4. heartbeat delta APIs + `paperclip` skill rewrite
5. session compaction/rotation
6. skill allowlists

## Acceptance Metrics

We should treat this plan as successful only if we improve both efficiency and task outcomes.

Primary metrics:

- normalized input tokens per successful heartbeat
- normalized input tokens per completed issue
- cache-hit ratio for sessioned adapters
- session reuse rate by invocation source
- fraction of heartbeats that fetch full comment threads

Guardrail metrics:

- task completion rate
- blocked-task rate
- stale-session failure rate
- manual intervention rate
- issue reopen rate after agent completion

Initial targets:

- 30% to 50% reduction in normalized input tokens per successful resumed heartbeat
- 80%+ session reuse on stable timer wakes
- 80%+ reduction in full-thread comment reloads after first task read
- no statistically meaningful regression in completion rate or failure rate

## Concrete Engineering Tasks

1. Add normalized usage fields and migration support for run analytics.
2. Patch sessioned adapter accounting to compute deltas from prior session totals.
3. Change `shouldResetTaskSessionForWake(...)` so timer wakes do not reset by default.
4. Implement `bootstrapPromptTemplate` end-to-end in adapter execution.
5. Add compact heartbeat context and incremental comment APIs.
6. Rewrite `skills/paperclip/SKILL.md` around delta-fetch behavior.
7. Add session rotation with carry-forward summaries.
8. Replace global skill injection with explicit allowlists.

## Recommendation

Treat this as a two-track effort:

- **Track A: correctness and no-regret wins**
  - telemetry normalization
  - timer-wake session reuse
  - bootstrap prompt implementation
- **Track B: structural token reduction**
  - delta APIs
  - skill rewrite
  - session compaction
  - skill allowlists

If we only do Track A, we will improve things, but agents will still re-read too much unchanged task context.

If we only do Track B without fixing telemetry first, we will not be able to prove the gains cleanly.