agent-harness-kit

The infrastructure layer that makes AI agents production-ready.

Solo-dev harness engineering kit for Claude Code, with an experimental Codex-readable runtime surface. One command, ~30 minutes, and your hobby project gets the patterns that took OpenAI from prototype to 1M lines of agent-generated code: layered architecture, structural tests, garbage collection, review subagents, JSON feature tracking, and pre-completion checklists — without the enterprise overhead.

![npm version](https://www.npmjs.com/package/agent-harness-kit) ![License: MIT](https://opensource.org/licenses/MIT)

---

The Harness Engineering Shift

February 2026: OpenAI published "Harness engineering: leveraging Codex in an agent-first world" documenting how their Frontier Product Exploration team built an internal product with ~1 million lines of code over 5 months — with zero lines manually written by humans.

The results:

3 engineers → 7 engineers
~1,500 PRs merged (3.5 PRs per engineer per day)
Each engineer operating at 3-10x capacity through agent delegation
Agents running autonomously for 6+ hours per task
~1 billion tokens processed per day

The insight: The work shifted from writing code to engineering the harness — the infrastructure, constraints, and feedback loops that make agents reliable at scale.

March 2026: LangChain demonstrated this principle empirically. By improving their agent harness alone (no model changes), they jumped from 52.8% → 66.5% on Terminal-Bench 2.0, climbing 25 spots on the leaderboard.

The pattern is clear: Harness quality matters more than model choice for production outcomes.

---

Why This Kit Exists

You're a solo developer or small team. You don't have OpenAI's infrastructure budget or Stripe's agent platform team. But you can adopt the same patterns at hobby-project scale:

What you get:

Proven patterns from production harnesses — OpenAI's two-fold initializer/coding-agent split, Anthropic's CLAUDE.md table-of-contents approach, Mitchell Hashimoto's "engineer the harness" discipline
33 skills that codify rituals from teams shipping agent-generated code at scale (/add-feature, /context-query, /garbage-collection, /remember-project, /project-status, /review-this-pr, etc.)
10 read-only review subagents for cheap second-opinion passes and mandatory done-claim advice (advisor, architecture, security, reliability, performance, API consistency, trace failure, eval rubric, adapter compatibility, release readiness)
Structural enforcement via TypeScript, Python, Go, Rust, Swift, and Kotlin adapters — catch layer violations and high-risk shortcuts before they compound
Architecture fitness plugins — repo-local JSON rules for env, DB, provider, validation, and public-API boundaries with reviewer routing
Policy packs — stack-specific governance defaults for nextjs-saas, api-backend, and python-data
Cost guardrails and attribution — default budget plus provider-call cost by skill, task, and cache read/write bucket
Model routing evidence — lane-level model usage report so cheap explore lanes and stronger implementation/review lanes are measured, not guessed
Sanitized trace corpus — public success/failure traces for tiny, normal, high-risk, false-done, overbroad-edit, reviewer-gap, replay, bypass, and runtime-parity cases
JSON feature tracking (not Markdown) — Anthropic's pattern for machine-readable planning
Task contracts + evidence bundles — a feature can only move to passes: true when the current diff has machine-readable proof, concrete checks, and a diff summary
SQLite operational state — local harness.db records intake, stories, decisions, backlog, traces, friction, and trace quality without hand-editing Markdown tables
Context rules + trace scoring — phase-by-lane retrieval guidance plus minimal/standard/detailed trace quality gates for tiny, normal, and high-risk work
Orchestration contracts — multi-agent runs bind lanes, tool policy, required reviewers, task ids, and output artifacts to a checked workflow contract
Failure-to-rule records — every recurring agent miss can be captured as JSON and promoted into a durable harness prevention
Adversarial eval suite — deterministic red-team probes for fake evidence, missing high-risk attestation, protected-path bypasses, unsafe eval commands, unreviewed bypasses, and prompt-level hook bypass attempts
Pre-completion checklists — OpenAI's golden-principles garbage collection ritual, scaled to top-3 fixes per week

What this kit does NOT claim:

Structural tests don't differentiate on happy-path 1-shot tasks. When seed code shows the pattern, Claude follows it — we measured 0/6 layer violations across bare and kit arms on our ts-layered fixture (5 consecutive null benches, May 2026).
The value is in long sessions, adversarial pressure, greenfield code, and weaker models — where pattern context drifts and shortcuts become tempting. Use the lint as a safety net, not as the reason you adopted the kit.

Project Direction

The future destination is tracked in docs/HARNESS_INTELLIGENCE_ROADMAP.md. The roadmap treats the current task/evidence/review/runtime/state foundation as the baseline and lays out the next product lanes: explainable gates, replayable evidence, compiled permissions, bypass governance, adversarial evals, runtime parity scorecards, policy packs, dashboards, and PR-level enforcement.

---

Installation

Option A: One-line install (recommended)

curl -sL https://raw.githubusercontent.com/tuanle96/agent-harness-kit/main/install.sh | bash

If the interactive prompt exits with aborted by user at Project name in a piped shell, rerun with defaults:

curl -sL https://raw.githubusercontent.com/tuanle96/agent-harness-kit/main/install.sh | bash -s -- --yes

Or run the initializer directly so the prompt owns the terminal input:

npx agent-harness-kit init

Upgrade existing installation:

curl -sL https://raw.githubusercontent.com/tuanle96/agent-harness-kit/main/install.sh | bash -s -- --upgrade

Upgrade is non-destructive: user-edited managed files get sidecars, and user-owned config is only patched for missing compatibility defaults such as project memory, task/evidence contracts, failure-learning signal bundles, orchestration/model-routing policy, and release-readiness review promotion gates. Use npx agent-harness-kit upgrade --plan to write an impact artifact under .harness/upgrades/<runId>/plan.json before applying, and

npx agent-harness-kit upgrade --explain <changeId> to inspect a single planned change with category, diff summary, and rollback notes.

To delegate install or upgrade to an AI agent, generate a versioned onboarding prompt instead of hand-writing a checklist:

npx agent-harness-kit prompt install --runtime codex
npx agent-harness-kit prompt upgrade --runtime claude,codex

The prompt tells the agent to inspect the repo, run doctor, apply init/upgrade, verify runtime hooks, and prove .harness/memory/current-summary.md is injected by SessionStart. The matching machine gate is:

node .harness/scripts/check-runtime-surface.mjs --runtime=codex

Option B: Scaffold into existing repo

npx agent-harness-kit init

For the experimental Codex surface:

npx agent-harness-kit init --runtime codex
npx agent-harness-kit init --runtime claude,codex

Option C: Install as Claude Code plugin

/plugin marketplace add tuanle96/agent-harness-kit
/plugin install agent-harness-kit@agent-harness-kit-marketplace

---

What Ships

Agent Runtime Support

| Runtime | Generated surface | Status | Notes | | ------- | ----------------- | ------ | ----- | | Claude Code | CLAUDE.md, .claude/settings.json, .claude/skills, .claude/agents, .claude/hooks/hooks.json | Supported | Default target. Existing installs remain Claude-first. | | Codex CLI/App | AGENTS.md, .codex/hooks.json, .codex/agents/.toml, .agents/skills, shared .harness/ files | Experimental | Renders native Codex instructions, project hooks, project TOML agents, and Codex-loadable skills without .claude/; real Codex smoke covers the surface and probes native hook artifacts. | | Kiro CLI | .kiro/steering/harness.md, .kiro/skills//SKILL.md, .kiro/agents/.json (read-only reviewers) + primary .kiro/agents/harness.json (tools, skill:///steering resources, 5-trigger hooks), shared .harness/ files | Experimental | Renders steering rules, skill:// skills, and a per-agent hook block without .claude/. Guard enforcement works via Kiro tool-name normalization (AHK_RUNTIME=kiro). Kiro exposes only 5 of 9 lifecycle triggers and its Stop hook is advisory (cannot hard-block), so precompletion is warn-only. Use --runtime kiro. | | Antigravity IDE | GEMINI.md, .agents/skills, .agents/hooks.json, shared .harness/ files | Experimental | Renders native Antigravity instructions, project hooks (PreToolUse, PostToolUse, PreInvocation, Stop), and shared .agents/skills/. Guard enforcement works via AHK_RUNTIME=antigravity. Antigravity exposes ~5 of 9 Claude lifecycle triggers; PreCompact, SubagentStop, SessionEnd, UserPromptSubmit have no equivalent. Use --runtime antigravity. | | Dual/Multi target | Any combination of CLAUDE.md, AGENTS.md, GEMINI.md plus shared .harness/* files | Experimental | Use --runtime claude,codex, --runtime claude,antigravity, etc. |

Run npm run report:runtime-parity in this repository, or

node .harness/scripts/runtime-parity-report.mjs in a generated project, to publish the evidence-gated scorecard. Current release evidence scores Claude at 11/11 capabilities and Codex at 11/12 weighted capabilities. Codex is passing for skill rendering, hook availability, mutation guards including

apply_patch, evidence/advisor/task-contract gates, transcript parsing, telemetry, and orchestration run support. Codex remains experimental for native hook-fire conformance and native subagent/reviewer artifact capture. Add --fail-partial when a release lane should fail on any partial runtime capability instead of reporting it as a warning. Partial rows include machine-readable promotion criteria and next steps so the experimental surface is explicit instead of buried in prose. Set AHK_E2E_CODEX_REQUIRE_HOOKS=1 when running the Codex E2E driver to promote the native hook probe into a blocking check. The driver initializes a real git workspace and prints Codex feature/JSONL diagnostics when lifecycle hook artifacts are missing. Set AHK_E2E_CODEX_REQUIRE_REVIEWER_ARTIFACT=1 to make the Codex reviewer decision artifact probe blocking as well. For kit release lanes, use npm run check:codex-parity-probes -- --hooks,

npm run check:codex-parity-probes -- --reviewer, or

npm run check:codex-parity-probes -- --all --json to run the strict probes through one repeatable wrapper.

Run npm run report:runtime-conformance in this repository, or

node .harness/scripts/runtime-conformance.mjs in a generated project, to publish adapter conformance results. The suite checks that each advertised runtime target renders its install surface, loads skills, wires hooks, blocks protected mutations, discovers task contracts, gates false done claims, captures review/advisor artifacts, records telemetry, and supports bounded orchestration.

Run npm run check:trace-corpus in this repository, or

node .harness/scripts/check-trace-corpus.mjs in a generated project, to validate the sanitized trace corpus used by eval tasks and model-routing outcome summaries.

Skills (33)

Slash commands that codify production harness rituals:

| Command | Purpose | | -------------------------------- | -------------------------------------------------------- | | /add-feature <description> | Implement one item from .harness/feature_list.json | | /add-adr | Add a numbered Architecture Decision Record | | /benchmark-suite | Run Mini SWE-bench style harness regression tasks | | /context-health | Inspect context usage, token budget, and compaction risk | | /context-query <question> | Build a source-linked context packet for codebase questions | | /create-story | Create an acceptance-tested Story Packet | | /debug-flow | Run the failing flow before fixing it | | /deliver-html | Ship an analysis/audit/plan as a self-contained HTML | | /doc-drift-scan | Find stale path/command references in docs/ | | /eval-rubric-author | Add deterministic checks plus evidence-backed rubrics | | /eval-runner | Regression-test the harness itself | | /feature-intake | Classify new work before implementation | | /garbage-collection | Friday cleanup (top-3 fixes only at solo scale) | | /harness-improvement-loop | Turn trace-backed failures into measured harness changes | | /i18n-add-locale <code> | Scaffold a new translation locale for skills + CLAUDE.md | | /inspect-app | Boot dev server + drive the failing flow before edits | | /inspect-module <path> | Map a module before editing | | /map-domain | Render layer config + flag config-vs-filesystem drift | | /middleware-pipeline | Use retry/cache/timeout/telemetry/budget middleware | | /model-profile | Compare model profiles by pass rate, cost, and latency | | /orchestrate | Select or run a multi-agent workflow pattern | | /propose-harness-improvement | Convert an agent failure into a permanent prevention | | /project-status | Render project state, memory, and harness health to HTML | | /refactor-feature | Restructure .harness/feature_list.json with proof gate | | /remember-project | Store durable decisions, risks, scope, and handoff notes | | /regression-benchmark | Run Tier 2 isolated and multi-session regression benchmarks | | /review-this-pr | Deterministic diff review against the current base | | /setup-nightly-eval | Enable the nightly eval GitHub Actions workflow | | /skill-discovery | Index skills and load full instructions on demand | | /structural-test-author | Codify a new architectural rule mechanically | | /trace-analyzer | Classify eval/session failures from trace evidence | | /verify-ui | Run browser validation with screenshots and network logs | | /write-skill | Create a new SKILL.md with valid frontmatter |

Review Subagents (10)

Read-only personas for second-opinion passes. Required reviewers emit structured JSON decisions matching .harness/schemas/review-decision.schema.json; the evidence gate blocks passes: true when a required reviewer has no passing decision proof, and the review coverage gate checks checkedInvariants,

diffCoverage, confidence, and empty unreviewedRiskAreas.

advisor — mandatory advisor gate before done claims and high-risk triggers
architecture-reviewer — layering, coupling, cohesion
adapter-compatibility-reviewer — adapter claims, render paths, tests
api-consistency-reviewer — naming, versioning, breaking changes
eval-rubric-reviewer — deterministic checks and evidence-backed rubrics
security-reviewer — OWASP Top 10, auth, secrets
reliability-reviewer — error handling, retries, observability
performance-reviewer — N+1 queries, caching, indexing
release-harness-reviewer — package, installer, npm, and release truth
trace-failure-analyst — eval, regression, hook, and session failure triage

Hooks (9 event groups)

SessionStart: Inject compact project context on startup/resume/compact.
UserPromptSubmit: Block prompt patterns that bypass harness safety.
PreToolUse: Guard risky Bash/edit operations and enforce per-skill permission policy before tools run.
Notification: Notify on blocking states.
PostToolUse: Run structural checks after edits and record skill telemetry.
PreCompact: Snapshot state before context compaction.
Stop: Pre-completion checklist with stop_hook_active loop guard, including task/evidence validation for new passes: true claims and verbal done claims on the active task.
SubagentStop: Re-check structural state after subagent work.
SessionEnd: Roll up session telemetry and append a semantic project-memory summary.

Adapters (6)

TypeScript adapter: ts-morph + eslint-plugin-boundaries + dependency-cruiser
Python adapter: libcst + import-linter
Go adapter: go-parser structural checks + shared eval runner
Rust adapter: rust-lexer structural checks + shared eval runner
Swift adapter: swift-lexer structural checks + shared eval runner
Kotlin adapter: kotlin-lexer structural checks + shared eval runner

Ownership policy

User-owned files are never clobbered on init or upgrade: CLAUDE.md, AGENTS.md, .kiro/steering/harness.md, .harness/docs/architecture.md, .harness/docs/core-beliefs.md, .harness/docs/golden-principles.md, .harness/docs/tech-debt-tracker.md, .harness/feature_list.json, .harness/project/state.json, .harness/memory/ledger.jsonl, .harness/config.json.

Generated runtime files are kit-owned but existing local copies are preserved on first install: .codex/hooks.json, .codex/agents/, .agents/skills/, .kiro/agents/, .kiro/skills/.

Generated mutable files are kit-owned but expected to change at runtime: .harness/installed.json, .harness/PROGRESS.md, .harness/structural-baseline.json, .harness/memory/current-summary.md, .harness/project/status.html, .harness/project/handoff.json, .harness/state/harness.db, .harness/state/harness.db-, .codex/state/.

Projects can extend that protected set in .harness/config.json:

{
  "ownership": {
    "userOwnedFiles": [".harness/docs/local-runbook.md"],
    "generatedMutableFiles": [".harness/custom-state.json"]
  }
}

Eval Harness

Deterministic checks first, rubric second. The default runner grades: acceptance commands, outcome, process, style, and efficiency. New eval tasks should put product truth in expected.acceptanceChecks[]; model-assisted rubrics are only for evidence that commands cannot judge. npm run check:eval-tasks validates every eval task has at least one deterministic truth signal or an explicit justification for why it cannot. Acceptance commands must be concrete and non-destructive, and expected.requiredFiles[] must use repo-relative local paths. A rubric alone is not product truth.

npm run check:adversarial runs deterministic red-team probes against the installed/template harness surface. The suite proves known failure modes still block: placeholder evidence, high-risk evidence without attestation, direct passes: true edits, hook deletion, apply_patch moves into protected paths, direct baseline truncation, command substitution hidden inside read commands, unsafe eval acceptance commands, unreviewed bypass logs, reviewer coverage gaps, ADR gate gaps, stale current-diff evidence, prompt-level hook bypass attempts, and high-risk worktree isolation gaps. Generated projects get the same gate as npm run harness:adversarial.

npm run check:skill-examples validates golden and anti-example traces for the core skills. Each covered skill carries examples/good.trace.jsonl,

examples/good.evidence.json, examples/bad.false-done.trace.jsonl, and

examples/bad.overbroad-edit.trace.jsonl; skill-discovery indexes those paths, and skill-load <skill> --examples loads them only when needed.

npm run check:review-coverage validates structured reviewer pass decisions against task scope and changed source/config files. It rejects stale

diffCoverage, high-risk reviewer passes with low confidence, security passes that skip auth/session trust-boundary files, and architecture passes that claim high confidence while skipping impacted layers.

npm run check:architecture-fitness loads deterministic rule plugins from

.harness/fitness/rules/*.json. The shipped rules cover raw env access, DB-in-UI shortcuts, provider SDK bypasses, unvalidated request boundaries, and internal/deep imports. Each failing finding names the file, rule id, owner reviewer, failure taxonomy class, and prevention category so the block is actionable instead of opaque.

agent-harness-kit init --pack nextjs-saas applies stack-specific policy packs. The shipped packs cover Next.js SaaS apps, TypeScript/JavaScript API backends, and Python data pipelines. Selected packs render under

.harness/policy-packs/<id>/, wire their fitness rules into

architectureFitness.rulePaths, and validate with

agent-harness-kit pack validate or .harness/scripts/check-policy-packs.mjs. Third-party packs can also be validated from a custom --packs-dir without copying them into core templates. agent-harness-kit pack publish --dry-run builds a bounded publish plan, hashes the pack files, and rejects symlinks, scripts, oversized bundles, unreferenced rules, and unsafe paths before any registry upload is implemented.

.harness/schema-policy.json records the stable-schema contract for durable harness artifacts. check-stable-schemas.mjs validates schema versions, migration/deprecation rules, changelog coverage, and generated-install compatibility so task/evidence/review/failure records remain readable across upgrades.

---

Directory Structure

your-repo/
├── CLAUDE.md                          # 50–80 line table of contents
├── AGENTS.md                          # native Codex entrypoint when Codex target is enabled
├── .claude/
│   ├── settings.json
│   ├── skills/                        # 33 skills with SKILL.md + skill.json contracts
│   ├── agents/                        # 10 reviewer personas
│   └── hooks/hooks.json
├── .harness/
│   ├── config.json
│   ├── permissions.json               # per-skill tool allow/deny matrix
│   ├── skill-registry.json            # version/capability registry
│   ├── feature_list.json              # JSON, not Markdown — Anthropic pattern
│   ├── feature-list.schema.json        # feature state schema used by task evidence checks
│   ├── task-contracts/                # task contracts for agent-sized work
│   ├── orchestration/
│   │   ├── contracts/                 # checked workflow contracts + default example
│   │   └── <run-id>/                  # manifests, transcripts, summaries
│   ├── evidence/                      # done-proof bundles for completed work
│   ├── reviews/                       # structured reviewer decisions
│   ├── failures/
│   │   ├── taxonomy.json              # failure-to-rule learning loop
│   │   └── records/                   # promoted failure records
│   ├── schemas/
│   │   ├── task-contract.schema.json
│   │   ├── orchestration-contract.schema.json
│   │   ├── evidence-bundle.schema.json
│   │   ├── review-decision.schema.json
│   │   ├── eval-task.schema.json
│   │   ├── failure-record.schema.json
│   │   └── policy-pack.schema.json
│   ├── project/
│   │   ├── state.json                 # phases, MVP scope, risks, decisions, checklists
│   │   ├── status.html                # generated human dashboard
│   │   └── handoff.json               # optional portable team handoff
│   ├── memory/
│   │   ├── ledger.jsonl               # append-only shared project memory
│   │   └── current-summary.md         # compact SessionStart memory summary
│   ├── state/
│   │   └── schema.sql                 # SQLite operational-state schema; harness.db is local runtime state
│   ├── docs/
│   │   ├── architecture.md
│   │   ├── context-rules.md
│   │   ├── core-beliefs.md
│   │   ├── operational-state.md
│   │   ├── architecture-fitness.md
│   │   ├── strictness-ladder.md
│   │   ├── policy-pack-authoring.md
│   │   ├── runtime-parity-scorecard.md
│   │   ├── team-pr-adoption.md
│   │   ├── golden-principles.md
│   │   ├── trace-quality.md
│   │   ├── telemetry-schema.md
│   │   ├── tech-debt-tracker.md
│   │   └── adr/
│   │       └── 0001-use-agent-harness-kit.md
│   ├── installed.json                 # kit lockfile (sha-tracked)
│   ├── PROGRESS.md                    # session log
│   ├── scripts/
│   │   ├── structural-test-on-edit.sh # PostToolUse hook target
│   │   ├── precompletion-checklist.sh # Stop hook target
│   │   ├── pretooluse-skill-permission-guard.mjs
│   │   ├── check-structural-baseline.mjs
│   │   ├── check-hook-integrity.mjs
│   │   ├── check-skill-contracts.mjs
│   │   ├── check-skill-examples.mjs
│   │   ├── check-trace-corpus.mjs
│   │   ├── check-review-coverage.mjs
│   │   ├── check-architecture-fitness.mjs
│   │   ├── check-policy-packs.mjs
│   │   ├── check-stable-schemas.mjs
│   │   ├── policy-pack-publish.mjs
│   │   ├── bypass.mjs
│   │   ├── check-bypass-audit.mjs
│   │   ├── report-harness-noise.mjs
│   │   ├── pr-annotations.mjs
│   │   ├── check-eval-tasks.mjs
│   │   ├── check-adversarial-suite.mjs
│   │   ├── check-failure-records.mjs
│   │   ├── improvement-bundle.mjs
│   │   ├── check-orchestration-contracts.mjs
│   │   ├── orchestration-contract-from-task.mjs
│   │   ├── record-failure.mjs
│   │   ├── record-review-failures.mjs
│   │   ├── task-evidence-check.mjs
│   │   ├── check-evidence-attestation.mjs
│   │   ├── harness-readiness.mjs
│   │   ├── harness-state.mjs
│   │   ├── strictness.mjs
│   │   ├── orchestration-schema-check.mjs
│   │   ├── session-replay.mjs
│   │   ├── model-routing-report.mjs
│   │   ├── runtime-parity-report.mjs
│   │   ├── runtime-conformance.mjs
│   │   ├── project-memory.mjs
│   │   ├── project-status-report.mjs
│   │   ├── cost-tracker.mjs
│   │   ├── dev-up.sh
│   │   ├── pre-push.sh
│   │   └── install-git-hooks.sh
│   ├── fitness/
│   │   └── rules/
│   │       ├── no-raw-env-outside-config.json
│   │       ├── no-db-in-ui.json
│   │       ├── no-provider-bypass.json
│   │       ├── require-validation-at-boundary.json
│   │       └── no-internal-deep-imports.json
│   ├── policy-packs/
│   │   ├── api-backend/
│   │   ├── nextjs-saas/
│   │   └── python-data/
│   ├── trace-corpus/
│   │   ├── success-tiny-doc-fix.json
│   │   └── runtime-parity-failure.json
│   └── structural-baseline.json       # existing-violation baseline

---

Configuration (`.harness/config.json`)

{
  "version": "0.1.0",
  "language": "typescript",
  "framework": "nextjs",
  "preset": "nextjs",
  "domains": [
    {
      "name": "default",
      "root": "src",
      "layerDirPattern": "{layer}",
      "useIdentPattern": "{layer}",
      "layers": ["types", "config", "repo", "service", "runtime", "ui"]
    }
  ],
  "providers": ["auth", "telemetry", "feature-flags"],
  "structuralTest": {
    "engine": "ts-morph",
    "blockOnViolation": true,
    "rules": [
      {
        "id": "no-raw-env-outside-config",
        "kind": "no-raw-env",
        "allowLayers": ["config"]
      },
      {
        "id": "no-db-in-ui",
        "kind": "no-db-in-ui",
        "uiLayers": ["ui"],
        "dbLayers": ["repo"]
      },
      {
        "id": "no-provider-bypass",
        "kind": "no-provider-bypass",
        "allowLayers": ["config"],
        "allowPaths": ["src/providers/**", "src/telemetry/**", "src/instrumentation.*"]
      },
      {
        "id": "no-dynamic-import-in-layered-code",
        "kind": "no-dynamic-import",
        "enabled": false
      }
    ]
  },
  "architectureFitness": {
    "enabled": true,
    "rulesDir": ".harness/fitness/rules",
    "checker": ".harness/scripts/check-architecture-fitness.mjs",
    "blockOnViolation": true,
    "includeExamples": true
  },
  "policyPacks": {
    "enabled": true,
    "packsDir": ".harness/policy-packs",
    "schemaPath": ".harness/schemas/policy-pack.schema.json",
    "validator": ".harness/scripts/check-policy-packs.mjs",
    "selected": []
  },
  "policyPackPublishing": {
    "publisher": ".harness/scripts/policy-pack-publish.mjs",
    "dryRunOnly": true,
    "allowedFiles": [
      "pack.json",
      "README.md",
      "LICENSE",
      "LICENSE.md",
      "fitness-rules/*.json"
    ]
  },
  "schemaPolicy": {
    "policyPath": ".harness/schema-policy.json",
    "checker": ".harness/scripts/check-stable-schemas.mjs",
    "schemasDir": ".harness/schemas",
    "changelogSection": "Schema Compatibility",
    "minimumDeprecationDays": 90,
    "minimumMinorReleases": 2
  },
  "structuralBaseline": {
    "baselinePath": ".harness/structural-baseline.json",
    "checker": ".harness/scripts/check-structural-baseline.mjs",
    "decreasingOnly": true,
    "compareRef": "HEAD",
    "maxEntries": null
  },
  "hookIntegrity": {
    "checker": ".harness/scripts/check-hook-integrity.mjs"
  },
  "evals": {
    "tasksDir": ".harness/eval/tasks",
    "schemaPath": ".harness/schemas/eval-task.schema.json",
    "checker": ".harness/scripts/check-eval-tasks.mjs",
    "scheduleCron": "0 6 * * *",
    "dimensions": ["outcome", "process", "style", "efficiency"],
    "truthSignals": ["acceptanceChecks", "structuralTest", "requiredFiles", "skillsInvoked", "rubric"]
  },
  "garbageCollection": {
    "frequency": "weekly",
    "maxFixesPerRun": 3,
    "scope": ["dead-imports", "duplicate-utils", "layer-violations", "doc-drift"]
  },
  "agentRuntime": {
    "targets": ["claude"],
    "primary": "claude",
    "claude": {
      "instructionFile": "CLAUDE.md",
      "hooks": true,
      "skills": true,
      "agents": true
    },
    "codex": {
      "instructionFile": "AGENTS.md",
      "hooks": false,
      "skills": false,
      "agents": false
    }
  },
  "ownership": {
    "userOwnedFiles": [],
    "generatedMutableFiles": []
  },
  "goldenPrinciples": ".harness/docs/golden-principles.md",
  "projectMemory": {
    "enabled": true,
    "ledgerPath": ".harness/memory/ledger.jsonl",
    "summaryPath": ".harness/memory/current-summary.md",
    "maxSummaryEvents": 5,
    "redactSecrets": true
  },
  "projectManagement": {
    "enabled": true,
    "statePath": ".harness/project/state.json",
    "statusReportPath": ".harness/project/status.html",
    "handoffPath": ".harness/project/handoff.json"
  },
  "operationalState": {
    "enabled": true,
    "dbPath": ".harness/state/harness.db",
    "schemaPath": ".harness/state/schema.sql",
    "script": ".harness/scripts/harness-state.mjs",
    "retention": {
      "maxAgeDays": 30,
      "redactExports": true
    },
    "traceQuality": {
      "checker": ".harness/scripts/harness-state.mjs trace-quality --strict",
      "minimumByLane": { "tiny": "minimal", "normal": "standard", "high-risk": "detailed" }
    }
  },
  "failureLearning": {
    "enabled": true,
    "taxonomyPath": ".harness/failures/taxonomy.json",
    "recordsDir": ".harness/failures/records",
    "recordSchemaPath": ".harness/schemas/failure-record.schema.json",
    "signalBundler": ".harness/scripts/improvement-bundle.mjs",
    "checker": ".harness/scripts/check-failure-records.mjs",
    "recorder": ".harness/scripts/record-failure.mjs",
    "reviewRecorder": ".harness/scripts/record-review-failures.mjs",
    "maxProposedAgeDays": 14
  },
  "taskContracts": {
    "enabled": true,
    "contractsDir": ".harness/task-contracts",
    "schemaPath": ".harness/schemas/task-contract.schema.json",
    "evidenceDir": ".harness/evidence",
    "reviewsDir": ".harness/reviews",
    "evidenceSchemaPath": ".harness/schemas/evidence-bundle.schema.json",
    "reviewDecisionSchemaPath": ".harness/schemas/review-decision.schema.json",
    "checker": ".harness/scripts/task-evidence-check.mjs",
    "reviewCoverageChecker": ".harness/scripts/check-review-coverage.mjs",
    "stopActiveEvidence": "on-claim",
    "requireActiveTaskForMutationTargets": true
  },
  "evidenceAttestation": {
    "checker": ".harness/scripts/check-evidence-attestation.mjs",
    "verifyHashes": true,
    "replayPlan": true,
    "requireForPassingEvidence": true
  },
  "sessionIsolation": {
    "enabled": true,
    "checker": ".harness/scripts/check-session-isolation.mjs",
    "preparer": ".harness/scripts/prepare-session-worktree.mjs",
    "activeTaskEnv": "AHK_ACTIVE_TASK",
    "activeTaskPath": ".harness/active-task.json",
    "activeTaskEnvPath": ".harness/active-task.env",
    "worktreesDir": "../.agent-worktrees",
    "manifestDir": ".harness/sessions",
    "protectedBranches": ["main", "master", "develop", "release/*"],
    "branchPrefixes": ["agent/", "codex/"],
    "requireLinkedWorktree": true,
    "requireForRiskTiers": ["high-risk"],
    "requireForMutationTargets": true,
    "cleanupOnSessionEnd": false
  },
  "orchestration": {
    "enabled": true,
    "contractsDir": ".harness/orchestration/contracts",
    "runsDir": ".harness/orchestration",
    "schemaPath": ".harness/schemas/orchestration-contract.schema.json",
    "checker": ".harness/scripts/check-orchestration-contracts.mjs",
    "runtimeValidator": ".harness/scripts/orchestration-schema-check.mjs",
    "maxConcurrency": 3,
    "maxAgents": 6,
    "allowedPatterns": ["pipeline", "fanout", "fanin", "expert-pool", "red-team", "supervisor"],
    "requireTaskForMutation": true,
    "requireReviewerLanes": true
  },
  "models": {
    "main": "claude-sonnet-4-6",
    "reviewers": "claude-sonnet-4-6",
    "explore": "claude-haiku-4-5"
  },
  "modelRouting": {
    "reporter": ".harness/scripts/model-routing-report.mjs",
    "lanes": [
      { "id": "review", "expectedModel": "claude-sonnet-4-6", "matchSkills": ["review-this-pr", "*-reviewer"] },
      { "id": "high-risk", "expectedModel": "claude-sonnet-4-6", "riskTiers": ["high-risk"] },
      { "id": "explore", "expectedModel": "claude-haiku-4-5", "matchSkills": ["inspect-*", "map-domain", "context-query"] }
    ]
  },
  "traceCorpus": {
    "enabled": true,
    "corpusDir": ".harness/trace-corpus",
    "schemaPath": ".harness/schemas/trace-corpus-entry.schema.json",
    "checker": ".harness/scripts/check-trace-corpus.mjs",
    "requiredCases": ["success-tiny", "success-normal", "success-high-risk", "false-done", "overbroad-edit", "missing-reviewer", "evidence-replay-failure", "bypass-approval-flow", "runtime-parity-failure"],
    "redactSecrets": true
  },
  "readiness": {
    "reporter": ".harness/scripts/harness-readiness.mjs",
    "strictRelease": true,
    "strictDoctor": false,
    "reviewPromotion": "fail",
    "gates": [
      { "id": "structural-baseline", "command": "node .harness/scripts/check-structural-baseline.mjs", "required": true },
      { "id": "hook-integrity", "command": "node .harness/scripts/check-hook-integrity.mjs", "required": true },
      { "id": "structural", "command": "npm run --silent harness:check", "required": true },
      { "id": "skill-contracts", "command": "node .harness/scripts/check-skill-contracts.mjs", "required": true },
      { "id": "skill-examples", "command": "node .harness/scripts/check-skill-examples.mjs", "required": true },
      { "id": "trace-corpus", "command": "node .harness/scripts/check-trace-corpus.mjs", "required": true },
      { "id": "review-coverage", "command": "node .harness/scripts/check-review-coverage.mjs --strict", "required": true },
      { "id": "architecture-fitness", "command": "node .harness/scripts/check-architecture-fitness.mjs --strict", "required": true },
      { "id": "policy-packs", "command": "node .harness/scripts/check-policy-packs.mjs", "required": true },
      { "id": "permissions-drift", "command": "node .harness/scripts/check-permissions-drift.mjs", "required": true },
      { "id": "bypass-audit", "command": "node .harness/scripts/check-bypass-audit.mjs --strict", "required": true },
      { "id": "eval-tasks", "command": "node .harness/scripts/check-eval-tasks.mjs", "required": true },
      { "id": "adversarial-suite", "command": "node .harness/scripts/check-adversarial-suite.mjs", "required": true },
      { "id": "failure-records", "command": "node .harness/scripts/check-failure-records.mjs", "required": true },
      { "id": "operational-state", "command": "node .harness/scripts/harness-state.mjs check --strict", "required": true },
      { "id": "harness-report", "command": "node .harness/scripts/harness-report.mjs --json --fail-on=fail --review-promotion=fail", "required": true },
      { "id": "orchestration-contracts", "command": "node .harness/scripts/check-orchestration-contracts.mjs --strict", "required": true },
      { "id": "session-isolation", "command": "node .harness/scripts/check-session-isolation.mjs --strict", "required": true },
      { "id": "task-evidence", "command": "node .harness/scripts/task-evidence-check.mjs --strict", "required": true },
      { "id": "evidence-attestation", "command": "node .harness/scripts/check-evidence-attestation.mjs --strict", "required": true },
      { "id": "model-routing", "command": "node .harness/scripts/model-routing-report.mjs --strict", "required": false },
      { "id": "runtime-parity", "command": "node .harness/scripts/runtime-parity-report.mjs --strict", "required": false }
    ]
  },
  "budgets": { "perRunUsd": 2.0, "perDayUsd": 10.0 }
}

---

Philosophy (5 Axioms)

1. CLAUDE.md is a table of contents, not an encyclopedia

HumanLayer measured ~150–200 instructions as the reliable cap; OpenAI's own root file is ~100 lines. This kit's CLAUDE.md is 50–80 lines.

2. Every agent failure becomes a permanent harness change

Mitchell Hashimoto's "engineer the harness" discipline. The /propose-harness-improvement skill enforces this. Failures are classified with .harness/failures/taxonomy.json and promoted through .harness/failures/records/*.json so repeated mistakes become a skill, hook, structural rule, eval task, permission policy, or docs patch. Each taxonomy class has a preferred prevention target; records that choose a different target must include preventionJustification so the learning loop does not drift into arbitrary cleanup. The checker also verifies that proposedPrevention.path matches the selected target, for example eval-task points at eval/regression tasks and

permission-policy points at permissions, task contracts, or the permission guard. Use .harness/scripts/record-failure.mjs to write the record directly, or

.harness/scripts/record-review-failures.mjs to convert block and

needs-human review decision artifacts under .harness/reviews/ into proposed records with source=review. The package CLI wraps the same flow:

npx agent-harness-kit failure propose --from-review <path>,

npx agent-harness-kit failure promote <recordId>, and

npx agent-harness-kit failure verify <recordId>; verify runs the record's stored proposedPrevention.verificationCommand when no observed result is provided. Proposed records now receive a concrete proposedPrevention template for docs, skill, hook, structural rule, eval task, permission policy, or reviewer/subagent updates, so the follow-up is explicit: inspect the record, implement the prevention artifact, promote it, then rerun the checker and strict report. Then npm run check:failure-records validates the taxonomy and every promotion record before release. applied records must already include the deterministic verification command that will prove the prevention; verified records add the observed result from running it. Proposed records older than

maxProposedAgeDays fail the checker so the learning loop cannot become a silent backlog.

3. Computational sensors as safety net

Fowler/Böckeler's architectural fitness functions. The TypeScript, Python, Go, Rust, Swift, and Kotlin adapters ship deterministic structural checks; LLM subagents are reserved for semantic judgment.

TypeScript scaffolds include a small governance rule pack: env access must go through the config layer, UI cannot import repository/database clients directly, provider SDKs must go through provider boundary modules, and strict projects can enable a dynamic-import ban so the structural test can see dependencies statically.

Note: In our 1-shot bench (n=3, ts-layered), the agent already followed visible seed patterns and produced 0 boundary violations without enforcement. Treat structural tests as a safety net for drift in long sessions, not as a happy-path differentiator.

4. Garbage collection over Friday cleanup, scaled to solo

OpenAI's golden-principles ritual, shrunk to top-3 fixes per week.

4b. Model choices are lane-scoped measurements

Use /model-profile and .harness/scripts/model-routing-report.mjs before changing model defaults. Cheap models belong on read-only exploration only when telemetry confirms the lane, and high-risk task contracts should not silently run on the explore model.

4c. Release readiness is one gate, not scattered memory

Run node .harness/scripts/harness-readiness.mjs --strict in generated projects (npm run harness:readiness is added for TypeScript projects) or

npm run check:readiness in this repository before release. The readiness runner aggregates structural baseline debt, hook integrity, structural checks, skill contracts, skill example traces, review coverage, architecture fitness rules, eval task truth, adversarial red-team probes, failure records, review promotion state, orchestration contracts, task evidence, bypass audit, model-routing policy, policy pack validity, and runtime parity so “ready” has one auditable command.

Readiness can also compile its effective gate list from the strictness ladder. Generated installs set strictness.tier plus

readiness.compileFromStrictness: true; existing installs can opt in with

node .harness/scripts/strictness.mjs set strict. Tiers are starter,

standard, strict, release, and team, so projects can move from warn-first adoption to release-grade enforcement without hand-editing every gate. Repo-local custom gates stay additive during compilation; set

readiness.compileFromStrictness: false when a project needs the raw gate list unchanged.

agent-harness-kit doctor runs a fast non-structural readiness preflight by default; use agent-harness-kit doctor --strict or set

readiness.strictDoctor: true when the full structural release gate should be part of doctor itself.

When CI files are installed, .github/workflows/harness.yml and

.github/workflows/eval-nightly.yml run the readiness gate and `git diff

--check` before agent/eval work. The nightly eval workflow conditionally installs Node and Python dependencies so generated projects keep the same preflight shape across supported stacks.

Bypass escape hatches (AHK_ALLOW_BYPASS=1 and permission warn mode) append to

.harness/bypass.log. The release gate runs

node .harness/scripts/check-bypass-audit.mjs --strict; any bypass record must be reviewed in .harness/bypass-audit.json or covered by an approved, unexpired .harness/bypass-requests/*.json request before readiness passes.

npm run report:harness-noise ranks noisy rules from block telemetry, bypasses, false-positive acknowledgements, review latency, and loop-guard activations so maintainers can tune rules instead of disabling them.

Structural baselines are also part of readiness. Generated projects ship

.harness/scripts/check-structural-baseline.mjs, which validates the baseline is an array of unique violation keys and blocks growth versus HEAD when a previous baseline exists. Non-empty baselines show up as dashboard debt in

harness-report; fixes should shrink the file rather than append new entries.

Hook integrity is a release gate too. Generated projects ship

.harness/scripts/check-hook-integrity.mjs, which verifies enabled Claude and Codex hook surfaces route to shipped scripts, those scripts still exist and are executable, Codex hook commands carry AHK_RUNTIME=codex, and Claude hooks are merged into .claude/settings.json because that is the file Claude Code actually reads.

Session/worktree isolation is also part of readiness. Generated projects ship

.harness/scripts/check-session-isolation.mjs, which is idle when no active task is set, but fails strict checks when an active task contract can mutate source/config on a protected branch or outside a linked git worktree. High-risk and mutating task contracts should run from an agent/ or codex/ branch in a linked worktree so long-running agents cannot trample the primary checkout. Use .harness/scripts/prepare-session-worktree.mjs --task=<id> to create that linked worktree from the task contract, write .harness/active-task.json and

.harness/active-task.env, record the session manifest under the new worktree's .harness/sessions/ directory, and best-effort record the worktree in SQLite operational state. The checker also warns on stale session manifests and generated worktrees that lack a manifest, with cleanup commands pointing back to the preparer. SessionEnd writes .harness/session-cleanup.jsonl on every teardown with

cleanupStatus (not-needed, skipped, succeeded, or failed). Actual worktree removal stays opt-in via AHK_SESSION_CLEANUP=1 or

sessionIsolation.cleanupOnSessionEnd: true.

Orchestration contracts are part of the same release gate. Store them under

.harness/orchestration/contracts/<id>.json, then run

node .harness/scripts/check-orchestration-contracts.mjs --strict or

npm run harness:orchestration:check. The checker fails mutating lanes without a task contract and evidence requirement, reviewer requirements without matching reviewer lanes, unsafe output paths, and recorded run manifests that no longer match their contract.

check-skill-contracts also validates runtime skill surfaces for installed projects: every enabled .claude/skills or .agents/skills surface must carry the registry skills for that runtime. Claude-only installs are not forced to ship Codex skills, and Codex-only installs are not forced to ship Claude skills. It also keeps the skill permission spine honest: skill.json is the source of truth, registry/policy entries must match it, frontmatter cannot grant outside it, and sensitive Bash lanes cannot use broad grants such as Bash(git*),

Bash(gh), Bash(node), or Bash(*).

For the human-readable dashboard, run

node .harness/scripts/harness-report.mjs --html or

agent-harness-kit report --html. The output is static, self-contained HTML written to .harness/reports/harness-dashboard.html by default. Use

node .harness/scripts/harness-report.mjs --json --fail-on=fail when CI needs a machine-readable pass | warn | fail health payload. Release gates run it with

--review-promotion=fail, so actionable block or needs-human review artifacts must be promoted into failure records before readiness passes. The report summarizes eval drift, task/evidence health, UI evidence quality, review decision health, failure-learning records, stale proposed failures, orchestration contracts/runs, session isolation manifests, failure prevention target mix, applied-but-unverified preventions, applied preventions missing a verification command, model-routing telemetry, bypass-audit review state, structural baseline debt, hook integrity, and skill permission health so policy gaps are visible before they become agent behavior.

For PR-facing output, run

node .harness/scripts/pr-annotations.mjs --github-annotations. It reuses the readiness, task/evidence, bypass, architecture fitness, and runtime parity gates, then writes .harness/reports/pr-annotations.md plus SARIF so forked PRs still get a Markdown summary while GitHub Actions can show inline errors.

4d. Task contracts can narrow tool permissions

The PreToolUse guard now enforces both skill permissions and task-contract permissions when an active task is known. Deny rules always win, task allow lists restrict the skill further, and high-risk contracts must declare a non-wildcard permissions.allow list plus scope.allowedLayers; otherwise the runtime guard keeps the task read-only until the contract is narrowed. Generated projects also default taskContracts.requireActiveTaskForMutationTargets to true: Edit, Write, and MultiEdit against configured source roots or technical config files require an active task contract with explicit

permissions.allow. Harness proof artifacts are exempt so agents can create the contract and evidence trail before touching implementation code.

4e. Session isolation makes autonomous runs auditable

Long-running agents should not work directly in the human's primary checkout. Use node .harness/scripts/check-session-isolation.mjs --strict or

npm run harness:session:check in generated TypeScript projects to verify that the active task is on a non-protected agent branch and, when isolation is required, inside a linked git worktree. The default policy requires isolation for high-risk contracts and any contract whose permissions allow

Edit, Write, MultiEdit, or apply_patch.

To create the isolated workspace, run:

node .harness/scripts/prepare-session-worktree.mjs --task=<task-id>

The script refuses protected branch names, refuses nested worktrees inside the primary checkout, creates an agent/<task-id> branch by default, and prints the active-task env file needed by the checker and permission guard. Use

node .harness/scripts/prepare-session-worktree.mjs --cleanup --task=<task-id> to remove a prepared worktree after the task is closed. SessionEnd records cleanup status on every teardown and only removes worktrees when

AHK_SESSION_CLEANUP=1 or sessionIsolation.cleanupOnSessionEnd is enabled.

5. HTML for human deliverables, Markdown for agent files

Markdown is the right format for files an agent reads-and-edits (CLAUDE.md, SKILL.md, ADRs)
HTML is the right format for documents a HUMAN reads-and-decides (audit reports, analyses, plans, decision docs)

A long Markdown deliverable invites the human to scroll, miss the conclusion, and ask the agent to clarify — burning more tokens than the HTML markup costs. The /deliver-html skill writes self-contained HTML at repo root with a shared dark-theme CSS; the rule is documented in golden principle #11 and ADR-0002.

---

CLI Commands

agent-harness-kit init        # scaffold a repo (interactive)
agent-harness-kit init --yes  # accept all detected defaults
agent-harness-kit init --runtime codex       # render AGENTS.md + .codex/.agents + .harness/* only
agent-harness-kit init --runtime claude,codex # render both runtime instruction files
agent-harness-kit init --runtime antigravity  # render GEMINI.md + .agents + .harness/* only
agent-harness-kit init --runtime claude,antigravity # render both Claude and Antigravity surfaces
agent-harness-kit init --pack nextjs-saas # apply Next.js SaaS governance defaults
agent-harness-kit init --pack api-backend,python-data # apply multiple policy packs
agent-harness-kit upgrade     # non-destructive upgrade, preserves user edits
agent-harness-kit upgrade --plan # write .harness/upgrades/<runId>/plan.json without applying managed file changes
agent-harness-kit upgrade --apply --yes # explicitly apply the planned upgrade behavior
agent-harness-kit upgrade --explain <changeId> # explain one change from the latest plan artifact
agent-harness-kit doctor      # diagnose runtime + run harness preflight checkers
agent-harness-kit doctor --strict # also run the full readiness gate
agent-harness-kit doctor --runtime codex # diagnose the Codex surface
agent-harness-kit prompt install --runtime codex # print an AI-agent onboarding prompt
agent-harness-kit prompt upgrade --runtime claude,codex # prompt an agent to upgrade + verify memory/hooks
agent-harness-kit explain --bypass <fingerprint> # explain bypass audit coverage and repair path
agent-harness-kit context-query "How does task evidence validation work?" --scope scripts --lane normal --json
agent-harness-kit state doctor # diagnose SQLite state health
agent-harness-kit state migrate --dry-run # preview state schema migrations
agent-harness-kit state export --redact # export shareable redacted state JSON
agent-harness-kit state prune --older-than=30d --dry-run # preview retention cleanup
agent-harness-kit state explain <runId> # inspect matching trace/story/session rows
agent-harness-kit report --html # write .harness/reports/harness-dashboard.html
agent-harness-kit strictness set strict # migrate readiness gates to a stricter tier
agent-harness-kit pack validate --pack nextjs-saas # validate one policy pack manifest + rule examples
agent-harness-kit pack publish --pack nextjs-saas --dry-run --json # review a publishable pack bundle
agent-harness-kit --version
node .harness/scripts/harness-readiness.mjs --strict # generated project release/readiness gate
node .harness/scripts/harness-state.mjs init # initialize SQLite operational state
node .harness/scripts/harness-state.mjs trace-quality --strict # score trace/friction quality
node .harness/scripts/strictness.mjs plan --tier=release # preview compiled readiness gates
node .harness/scripts/runtime-parity-report.mjs # publish Claude/Codex parity scorecard
node .harness/scripts/runtime-conformance.mjs # publish runtime adapter conformance results
node .harness/scripts/check-trace-corpus.mjs # validate sanitized public trace corpus
node .harness/scripts/pr-annotations.mjs --github-annotations # write PR annotations, Markdown, and SARIF
node .harness/scripts/check-policy-packs.mjs # validate installed policy packs
node .harness/scripts/check-stable-schemas.mjs --strict # validate schema compatibility policy
node .harness/scripts/check-evidence-attestation.mjs --strict # validate replayable evidence sidecars
node .harness/scripts/policy-pack-publish.mjs --pack nextjs-saas --dry-run # create pack publish plan
node .harness/scripts/check-session-isolation.mjs --strict # active-task worktree isolation gate
node .harness/scripts/prepare-session-worktree.mjs --task <id> # create isolated task worktree
node .harness/scripts/orchestration-contract-from-task.mjs <id> # derive workflow contract from task contract

context-query works without external services. Install srcwalk for stronger structural matches:

npm install -g srcwalk

Use --require-srcwalk only when your dev image or CI should fail closed if that binary is missing.

---

Token / Cost Expectations

A typical day with the default model split (Sonnet 4.6 main + Haiku 4.5 explore + Sonnet 4.6 reviewers) stays under ~$2 of API traffic for a single developer.

The eval-runner skill enforces a per-run budget set in .harness/config.json.

OpenAI's harness processed ~1 billion tokens per day with 7 engineers. At solo scale, you're looking at ~10-50M tokens/day depending on session intensity.

---

Support Matrix

| Stack | Adapter | Preset | Dev command | Status | | ------------------------------ | ------------ | ----------- | -------------------------------------- | ------ | | Next.js 14 + TypeScript | typescript | nextjs | npm run dev | v0.1 | | Express | typescript | node-api | node ./src/server.js | v0.1 | | Fastify | typescript | node-api | node ./src/server.js | v0.1 | | NestJS | typescript | node-api | npm run start:dev | v0.1 | | FastAPI | python | fastapi | uvicorn app.main:app --reload | v0.1 | | Django | python | django | python manage.py runserver | v0.1 | | Flask | python | flask | flask --app app run --debug | v0.1 | | Go | go | none | go run ./cmd/... | v0.4 | | Rust | rust | none | cargo run | v0.4 | | Swift | swift | none | swift run | v0.7 | | Kotlin | kotlin | none | ./gradlew run | v0.7 |

---

Dependency Footprint

Runtime dependencies are intentionally split by surface:

| Dependency | Why it is present | Impact if missing | | ---------- | ----------------- | ----------------- | | commander | CLI command routing (init, upgrade, doctor) | CLI cannot start | | @inquirer/prompts | Interactive init/upgrade prompts | Interactive mode fails; --yes paths still avoid most prompts | | @clack/prompts | Polyglot setup selector with cancel handling | Polyglot-root setup falls back poorly | | react + ink | Rich polyglot onboarding renderer only, not the hot scaffold path | Smart setup loses the app map UI; core render/upgrade logic still does not depend on React state | | handlebars | Template rendering | init/upgrade cannot render scaffold files | | picocolors | CLI diagnostics | Output loses structured color but behavior is otherwise unchanged |

Optional peer dependencies are adapter tooling, not core runtime:

| Peer dependency | Used by | When missing | | --------------- | ------- | ------------ | | ts-morph | TypeScript structural runner | npm run harness:check fails with an explicit install message | | eslint-plugin-boundaries | TypeScript ESLint defense-in-depth config | ESLint boundary config cannot run, but the ts-morph runner remains the primary gate | | dependency-cruiser | Optional TypeScript dependency graph checks | Dependency-cruiser reports are unavailable; structural runner still enforces layer direction |

The TypeScript init path patches these peer tools into the target repo's devDependencies non-destructively.

---

CI: Real-Claude E2E Test (v0.7+)

The kit ships a CI job that spawns the real claude binary against a fresh init of itself and asserts that the SessionStart hook actually fires (with the expected additionalContext payload).

This catches the class of bug that v0.6's silent-no-op hooks fell into — every synthetic test passed for seven releases while not a single hook ever triggered inside a real Claude Code session.

When real-Claude auth is available, the release gate also runs a real

/orchestrate --run E2E against a freshly initialized kit. That path verifies fanout/fanin runtime output, schema validation, transcript capture, telemetry export, session replay, cost attribution, and cache read/write bucket closure.

Behavior:

Locally: npm test runs the real-Claude E2E case. The machine must have the claude binary installed and authenticated through either local Claude Code auth or ANTHROPIC_API_KEY.
CI: the norma

agent-harness-kit

Summary

Install to Claude Code

agent-harness-kit

The Harness Engineering Shift

Why This Kit Exists

What you get:

What this kit does NOT claim:

Project Direction

Installation

Option A: One-line install (recommended)

Option B: Scaffold into existing repo

Option C: Install as Claude Code plugin

What Ships

Agent Runtime Support

Skills (33)

Review Subagents (10)

Hooks (9 event groups)

Adapters (6)

Ownership policy

Eval Harness

Directory Structure

Configuration (`.harness/config.json`)

Philosophy (5 Axioms)

1. CLAUDE.md is a table of contents, not an encyclopedia

2. Every agent failure becomes a permanent harness change

3. Computational sensors as safety net

4. Garbage collection over Friday cleanup, scaled to solo

4b. Model choices are lane-scoped measurements

4c. Release readiness is one gate, not scattered memory

4d. Task contracts can narrow tool permissions

4e. Session isolation makes autonomous runs auditable

5. HTML for human deliverables, Markdown for agent files

CLI Commands

Token / Cost Expectations

Support Matrix

Dependency Footprint

CI: Real-Claude E2E Test (v0.7+)

Related plugins

Plugins by category

agent-harness-kit

Summary

Install to Claude Code

agent-harness-kit

The Harness Engineering Shift

Why This Kit Exists

What you get:

What this kit does NOT claim:

Project Direction

Installation

Option A: One-line install (recommended)

Option B: Scaffold into existing repo

Option C: Install as Claude Code plugin

What Ships

Agent Runtime Support

Skills (33)

Review Subagents (10)

Hooks (9 event groups)

Adapters (6)

Ownership policy

Eval Harness

Directory Structure

Configuration (.harness/config.json)

Philosophy (5 Axioms)

1. CLAUDE.md is a table of contents, not an encyclopedia

2. Every agent failure becomes a permanent harness change

3. Computational sensors as safety net

4. Garbage collection over Friday cleanup, scaled to solo

4b. Model choices are lane-scoped measurements

4c. Release readiness is one gate, not scattered memory

4d. Task contracts can narrow tool permissions

4e. Session isolation makes autonomous runs auditable

5. HTML for human deliverables, Markdown for agent files

CLI Commands

Token / Cost Expectations

Support Matrix

Dependency Footprint

CI: Real-Claude E2E Test (v0.7+)

Related plugins

Plugins by category

Configuration (`.harness/config.json`)