[codex] Document terminal bench dispatch config (#4961)

## Thinking Path > - Paperclip agents rely on skills for repeatable operating procedures > - The Terminal-Bench loop skill needs to preserve enough dispatch configuration to reproduce real heartbeat behavior > - A bare benchmark command can create unassigned work with no heartbeat-enabled agent, which is a harness setup failure rather than product evidence > - The Paperclip heartbeat skill also needs to keep escalation biased toward agent-owned follow-through > - This pull request documents dispatch runner config requirements and strengthens the agent follow-through rule > - The benefit is fewer misleading benchmark loops and clearer agent operating guidance ## What Changed - Documented `PAPERCLIP_HARBOR_RUNNER_CONFIG` / runner dispatch config as required Terminal-Bench loop input. - Updated the Terminal-Bench loop smoke check to require the dispatch config mention. - Added stronger Paperclip skill guidance to avoid asking humans for work an agent can perform. ## Verification - `pnpm smoke:terminal-bench-loop-skill` ## Risks - Low risk: documentation and smoke expectation changes only. The stricter smoke assertion is intentional so future edits do not drop the dispatch config requirement. > For core feature work, check [`ROADMAP.md`](ROADMAP.md) first and discuss it in `#dev` before opening the PR. Feature PRs that overlap with planned core work may need to be redirected — check the roadmap first. See `CONTRIBUTING.md`. ## Model Used - OpenAI Codex, GPT-5 coding agent, tool use and local command execution. Exact context window was not exposed in the runtime. ## Checklist - [x] I have included a thinking path that traces from project context to this change - [x] I have specified the model used (with version and capability details) - [x] I have checked ROADMAP.md and confirmed this PR does not duplicate planned core work - [x] I have run tests locally and they pass - [x] I have added or updated tests where applicable - [x] If this change affects the UI, I have included before/after screenshots - [x] I have updated relevant documentation to reflect my changes - [x] I have considered and documented any risks above - [x] I will address all Greptile and reviewer comments before requesting merge --------- Co-authored-by: Paperclip <noreply@paperclip.ing>
2026-06-14 01:50:39 +09:00 · 2026-05-01 12:00:47 -05:00 · 2026-05-01 12:00:47 -05:00 · 685ee84e4a
commit 685ee84e4a
parent d7719423e9
3 changed files with 18 additions and 2 deletions
--- a/scripts/smoke/terminal-bench-loop-skill-smoke.mjs
+++ b/scripts/smoke/terminal-bench-loop-skill-smoke.mjs
@ -91,6 +91,7 @@ async function assertLocalSkillPackage() {
    "diagnosis",
    "blockedByIssueIds",
    "PAPERCLIPAI_CMD",
+    "PAPERCLIP_HARBOR_RUNNER_CONFIG",
  ]) {
    assert(markdown.includes(expected), `Skill smoke expected ${skillPath} to mention ${expected}`);
  }
@ -194,6 +195,8 @@ async function main() {
        `- Manifest: ${artifactRoot}/manifest.json`,
        `- Results JSONL: ${artifactRoot}/results.jsonl`,
        `- Harbor raw job folder: ${artifactRoot}/harbor/raw-job`,
+        "- Dispatch config: PAPERCLIP_HARBOR_RUNNER_CONFIG=<omitted - harness/setup no-dispatch smoke>",
+        "- Heartbeat-enabled agents: 0 (harness/setup no-dispatch; not a product signal)",
        "",
        "No benchmark process, Harbor job, model call, or provider call was started.",
      ].join("\n"),
@ -309,6 +312,7 @@ async function main() {
    `Expected iteration issue to be in_review, got ${verifiedIteration.status}`,
  );
  assert(verifiedRunDoc.body.includes(`${artifactRoot}/results.jsonl`), "Expected run doc to include mocked results path");
+  assert(verifiedRunDoc.body.includes("PAPERCLIP_HARBOR_RUNNER_CONFIG"), "Expected run doc to record dispatch config");
  assert(
    verifiedDiagnosisDoc.body.includes("Exact stop point") && verifiedDiagnosisDoc.body.includes("Next-action owner"),
    "Expected diagnosis doc to include exact stop point and next-action owner",