harness
A domain-agnostic plugin for running planner / generator / evaluator loops on complex tasks. Built around the lessons from Anthropic's long-running agent harness work, but stripped of coding-specific framing so the same shape works for research briefs, outreach batches, data deliverables, strategy memos, design, and code.
What it does
You give it a project — any kind. It scaffolds a .harness/ folder with three roles:
- Planner — turns a short prompt into a high-level spec (intentionally not granular).
- Generator — builds the artifact in iterations.
- Evaluator — operates the artifact (drives the live thing, walks the citations, dry-runs the email) and grades it against a weighted rubric you wrote.
The roles communicate through JSON files on disk. Every event appends to progress.jsonl. When the generator gets stuck hill-climbing on one criterion, the runtime archives the attempt and pivots to a fresh approach. When a new model lands, you re-assess and strip the scaffolding that's no longer load-bearing.
Install
This is a Claude Code plugin. Drop it into a marketplace or install locally:
git clone git@github.com:kamilseghrouchni/harness.git ~/.claude/plugins/harness
Restart Claude Code and the eight skills under skills/ become invokable.
First run
/instantiate-harness
Walks an AskUserQuestion wizard:
1. Domain (code · research · writing · outreach · data · design · other) 2. The user prompt that triggers a run 3. The verification action the evaluator will perform 4. Top 3–5 weighted criteria 5. Model per role, session strategy, budget cap, pivot policy
Writes .harness/config.json and four prompt files. Drop 3–5 calibration examples into .harness/calibration/, then:
/run-harness
How it runs
Claude Code is the orchestrator. When you type /run-harness, the skill dispatches each role (planner, generator, evaluator) as a fresh subagent via the Task tool — each gets its own context window. State lives on disk under .harness/state/. The Python under runtime/cli/ is utility-only (validate JSON, append to JSONL, decide pivots, archive stuck attempts) — called from the skill via Bash, not the other way around.
A headless runtime/runner.py exists for automation (cron, CI), but the primary path is Claude Code.
What's in this repo
| Path | Purpose | |---|---| | .claude-plugin/plugin.json | Plugin manifest | | CLAUDE.md | Auto-loaded principles digest | | docs/principles.html | Full architecture spec — the source of truth | | skills/ | Eight skills (instantiate · run · 4× prompt-writers · tune · re-assess) | | templates/ | Files copied into .harness/ on instantiation | | runtime/cli/ | Bash-callable utilities the run-harness skill invokes | | runtime/schemas/ | JSON Schemas for plan, contract, verdict, handoff, progress | | runtime/runner.py | Optional headless orchestrator (non-interactive use) | | examples/outreach-batch/ | Worked instantiation |
Principles in one paragraph
Split work across three roles, each with its own context window. Use the file system as state, JSON as the message format. Make the evaluator skeptical and calibrated, not just told to be harsh. Have generator and evaluator negotiate a contract before any work starts. Verify by operating the artifact, not by reading a description of it. Allow throwing the work away when stuck. Read the traces — that is the tuning loop. Re-examine the harness every model release and strip what no longer earns its cost.
See docs/principles.html for the full sixteen.
License
MIT.




