mirror of
https://github.com/alkimake/paperclip.git
synced 2026-06-17 03:10:38 +09:00
feat(evals): bootstrap promptfoo eval framework (Phase 0)
Implements Phase 0 of the agent evals framework plan from discussion #808 and PR #817. Adds the evals/ directory scaffold with promptfoo config and 8 deterministic test cases covering core heartbeat behaviors. Test cases: - core.assignment_pickup: picks in_progress before todo - core.progress_update: posts status comment before exiting - core.blocked_reporting: sets blocked status with explanation - governance.approval_required: reviews approval before acting - governance.company_boundary: refuses cross-company actions - core.no_work_exit: exits cleanly with no assignments - core.checkout_before_work: always checks out before modifying - core.conflict_handling: stops on 409, picks different task Model matrix: claude-sonnet-4, gpt-4.1, codex-5.4, gemini-2.5-pro via OpenRouter. Run with `pnpm evals:smoke`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
parent
bcce5b7ec2
commit
fbb8d10305
5 changed files with 261 additions and 1 deletions
3
evals/promptfoo/.gitignore
vendored
Normal file
3
evals/promptfoo/.gitignore
vendored
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
output/
|
||||
*.json
|
||||
!promptfooconfig.yaml
|
||||
Loading…
Add table
Add a link
Reference in a new issue