paperclip/doc/plans/2026-04-06-smart-model-routing.md

# 2026-04-06 Smart Model Routing

Status: Proposed
Date: 2026-04-06
Audience: Product and engineering
Related:
- `doc/SPEC-implementation.md`
- `doc/PRODUCT.md`
- `doc/plans/2026-03-14-adapter-skill-sync-rollout.md`

## 1. Purpose

This document defines a V1 plan for "smart model routing" in Paperclip.

The goal is not to build a generic cross-provider router in the server. The goal is:

- let supported adapters use a cheaper model for lightweight heartbeat orchestration work
- keep the main task execution on the adapter's normal primary model
- preserve Paperclip's existing task, session, and audit invariants
- report cost and model usage truthfully when more than one model participates in a single heartbeat

The motivating use case is a local coding adapter where a cheap model can handle the first fast pass:

- read the wake context
- orient to the task and workspace
- leave an immediate progress comment when appropriate
- perform bounded lightweight triage

Then the primary model does the substantive work.

## 2. Hermes Findings

Hermes does have a real "smart model routing" feature, but it is narrower than the name suggests.

Observed behavior:

- `agent/smart_model_routing.py` implements a conservative classifier for "simple" turns
- the cheap path only triggers for short, single-line, non-code, non-URL, non-tool-heavy messages
- complexity is detected with hardcoded thresholds plus a keyword denylist like `debug`, `implement`, `test`, `plan`, `tool`, `docker`, and similar terms
- if the cheap route cannot be resolved, Hermes silently falls back to the primary model

Important architectural detail:

- Hermes applies this routing before constructing the agent for that turn
- the route is resolved in `cron/scheduler.py` and passed into agent creation as the active provider/model/runtime

More useful than the routing heuristic itself is Hermes' broader model-slot design:

- main conversational model
- fallback model for failover
- auxiliary model slots for side tasks like compression and classification

That separation is a better fit for Paperclip than copying Hermes' exact keyword heuristic.

## 3. Current Paperclip State

Paperclip already has the right execution shape for adapter-specific routing, but it currently assumes one model per heartbeat run.

Current implementation facts:

- `server/src/services/heartbeat.ts` builds rich run context, including `paperclipWake`, workspace metadata, and session handoff context
- each adapter receives a single resolved `config` object and executes once
- built-in local adapters read one `config.model` and pass it directly to the underlying CLI
- UI config today exposes one main `model` field plus adapter-specific thinking-effort controls
- cost accounting currently records one provider/model tuple per run via `AdapterExecutionResult`

What this means:

- there is no shared routing layer in the server today
- model choice already lives at the adapter boundary, which is good
- multi-model execution in a single heartbeat needs explicit contract work or cost reporting will become misleading

## 4. Product Decision

Paperclip should implement smart model routing as an adapter-local, opt-in execution pattern.

V1 decision:

1. Do not add a global server-side router that tries to understand every adapter.
2. Do not copy Hermes' prompt-keyword classifier as Paperclip's default routing policy.
3. Add an adapter-specific "cheap preflight" phase for supported adapters.
4. Keep the primary model as the canonical work model.
5. Persist only the primary session unless an adapter can prove that cross-model session resume is safe.

Rationale:

- Paperclip heartbeats are structured, issue-scoped, and already include wake metadata
- routing by execution phase is more reliable than routing by free-text prompt complexity
- session semantics differ by adapter, so resume behavior must stay adapter-owned

## 5. Proposed V1 Behavior

## 5.1 Config shape

Supported adapters should add an optional routing block to `adapterConfig`.

Proposed shape:

```ts
smartModelRouting?: {
  enabled: boolean;
  cheapModel: string;
  cheapThinkingEffort?: string;
  maxPreflightTurns?: number;
  allowInitialProgressComment?: boolean;
}
```

Notes:

- keep existing `model` as the primary model
- `cheapModel` is adapter-specific, not global
- adapters that cannot safely support this block simply ignore it

For adapters with provider-specific model fields later, the shape can expand to include provider/base-url overrides. V1 should start simple.

## 5.2 Routing policy

Supported adapters should run cheap preflight only when all are true:

- `smartModelRouting.enabled` is true
- `cheapModel` is configured
- the run is issue-scoped
- the adapter is starting a fresh session, not resuming a persisted one
- the run is expected to do real task work rather than just resume an existing thread

Supported adapters should skip cheap preflight when any are true:

- a persisted task session already exists
- the adapter cannot safely isolate preflight from the primary session
- the issue or wake type implies the task is already mid-flight and continuity matters more than first-response speed

This is intentionally phase-based, not text-heuristic-based.

## 5.3 Cheap preflight responsibilities

The cheap phase should be narrow and bounded.

Allowed responsibilities:

- ingest wake context and issue summary
- inspect the workspace at a shallow level
- leave a short "starting investigation" style comment when appropriate
- collect a compact handoff summary for the primary phase

Not allowed in V1:

- long tool loops
- risky file mutations
- being the canonical persisted task session
- deciding final completion without either explicit adapter support or a trivial success case

Implementation detail:

- the adapter should inject an explicit preflight prompt telling the model this is a bounded orchestration pass
- preflight should use a very small turn budget, for example 1-2 turns

## 5.4 Primary execution responsibilities

After preflight, the adapter launches the normal primary execution using the existing prompt and primary model.

The primary phase should receive:

- the normal Paperclip prompt
- any preflight-generated handoff summary
- normal workspace and wake context

The primary phase remains the source of truth for:

- persisted session state
- final task completion
- most file changes
- most cost

## 6. Required Contract Changes

The current `AdapterExecutionResult` is too narrow for truthful multi-model accounting.

Add an optional segmented execution report, for example:

```ts
executionSegments?: Array<{
  phase: "cheap_preflight" | "primary";
  provider?: string | null;
  biller?: string | null;
  model?: string | null;
  billingType?: AdapterBillingType | null;
  usage?: UsageSummary;
  costUsd?: number | null;
  summary?: string | null;
}>
```

V1 server behavior:

- if `executionSegments` is absent, keep current single-result behavior unchanged
- if present, write one `cost_events` row per segment that has cost or token usage
- store the segment array in run usage/result metadata for later UI inspection
- keep the existing top-level `provider` / `model` fields as a summary, preferably the primary phase when present

This avoids breaking existing adapters while giving routed adapters truthful reporting.

## 7. Adapter Rollout Plan

## 7.1 Phase 1: contract and server plumbing

Work:

1. Extend adapter result types with segmented execution metadata.
2. Update heartbeat cost recording to emit multiple cost events when segments are present.
3. Include segment summaries in run metadata for transcript/debug views.

Success criteria:

- existing adapters behave exactly as before
- a routed adapter can report cheap plus primary usage without collapsing them into one fake model

## 7.2 Phase 2: `codex_local`

Why first:

- Codex already has rich prompt/handoff handling
- the adapter already injects Paperclip skills and workspace metadata cleanly
- the current implementation already distinguishes bootstrap, wake delta, and handoff prompt sections

Implementation work:

1. Add config support for `smartModelRouting`.
2. Add a cheap-preflight prompt builder.
3. Run cheap preflight only on fresh sessions.
4. Pass a compact preflight handoff note into the primary prompt.
5. Report segmented usage and model metadata.

Important guardrail:

- do not resume the cheap-model session as the primary session in V1

## 7.3 Phase 3: `claude_local`

Implementation work is similar, but the session model-switch risk is even less attractive.

Same rule:

- cheap preflight is ephemeral
- primary Claude session remains canonical

## 7.4 Phase 4: other adapters

Candidates:

- `cursor`
- `gemini_local`
- `opencode_local`
- external plugin adapters through `createServerAdapter()`

These should come later because each runtime has different session and model-switch semantics.

## 8. UI and Config Changes

For supported built-in adapters, the agent config UI should expose:

- `model` as the primary model
- `smart model routing` toggle
- `cheap model`
- optional cheap thinking effort
- optional `allow initial progress comment` toggle

The run detail UI should also show when routing occurred, for example:

- cheap preflight model
- primary model
- token/cost split

This matters because Paperclip's board UI is supposed to make cost and behavior legible.

## 9. Why Not Copy Hermes Exactly

Hermes' cheap-route heuristic is useful precedent, but Paperclip should not start there.

Reasons:

- Hermes is optimizing free-form conversational turns
- Paperclip agents run structured, issue-scoped heartbeats with explicit task and workspace context
- Paperclip already knows whether a run is fresh vs resumed, issue-scoped vs approval follow-up, and what workspace/session exists
- those execution facts are stronger routing signals than prompt keyword matching

If Paperclip later wants a cheap-only completion path for trivial runs, that can be a second-stage feature built on observed run data, not the first implementation.

## 10. Risks

## 10.1 Duplicate or noisy comments

If the cheap phase posts an update and the primary phase posts another near-identical update, the issue thread gets worse.

Mitigation:

- keep cheap comments optional
- make the preflight prompt explicitly avoid repeating status if a useful comment was already posted

## 10.2 Misleading cost reporting

If we only record the primary model, the board loses visibility into the routing cost tradeoff.

Mitigation:

- add segmented execution reporting before shipping adapter behavior

## 10.3 Session corruption

Cross-model session reuse may fail or degrade context quality.

Mitigation:

- V1 does not persist or resume cheap preflight sessions

## 10.4 Cheap model overreach

A cheap model with full tools and permissions may do too much low-quality work.

Mitigation:

- hard cap preflight turns
- use an explicit orchestration-only prompt
- start with supported adapters where we can test the behavior well

## 11. Verification Plan

Required tests:

- adapter unit tests for route eligibility
- adapter unit tests for "fresh session -> cheap preflight + primary"
- adapter unit tests for "resumed session -> primary only"
- heartbeat tests for segmented cost-event creation
- UI tests for config save/load of cheap-model fields

Manual checks:

- create a fresh issue for a routed Codex or Claude agent
- verify the run metadata shows both phases
- verify only the primary session is persisted
- verify cost rows reflect both models
- verify the issue thread does not get duplicate kickoff comments

## 12. Recommended Sequence

1. Add segmented execution reporting to the adapter/server contract.
2. Implement `codex_local` cheap preflight.
3. Validate cost visibility and transcript UX.
4. Implement `claude_local` cheap preflight.
5. Decide later whether any adapters need Hermes-style text heuristics in addition to phase-based routing.

## 13. Recommendation

Paperclip should ship smart model routing as:

- adapter-specific
- opt-in
- phase-based
- session-safe
- cost-truthful

The right V1 is not "choose the cheapest model for simple prompts." The right V1 is "use a cheap model for bounded orchestration work on fresh runs, then hand off to the primary model for the real task."