AI Review Arena
AI Review Arena is a CLI-first review harness for running multiple AI code reviewers, attaching local RAG evidence, comparing results with deterministic benchmarks, and exporting integration files for Claude Code, Codex, and Gemini CLI workflows.
The project does not depend on hosted provider APIs. Provider execution is intentionally CLI-based so it can run with the same tools a developer already uses locally.
What it does
- Runs Codex, Gemini, and Claude-oriented review flows through a typed Python runtime.
- Retrieves local evidence chunks with BM25, symbol, import, changed-file, and file-hint scoring.
- Attaches evidence chunks to findings so reports can show why a finding was produced.
- Benchmarks severity calibration, retrieval recall, harness ablations, and live provider samples.
- Emits structured harness events for tracing, cost, latency, tool calls, RAG, debate, aggregation, report generation, benchmark, and auto-fix phases.
- Exports OpenTelemetry-compatible JSONL or OTLP JSON, and can push OTLP JSON to an HTTP collector endpoint.
- Provides a policy-gated MCP runtime, including a JSON-RPC stdio server for configured MCP tools.
- Installs Claude Code hooks and agent files for project-local integration when explicitly enabled.
Architecture
flowchart LR
A[User /<br/>Claude Code Hook] --> B[Runner<br/>state machine]
B --> C[RAG Engine<br/>BM25 + symbol/import]
C --> D{Provider Runner}
D --> E[Claude CLI]
D --> F[Codex CLI]
D --> G[Gemini CLI]
E --> H[Aggregator<br/>+ Debate]
F --> H
G --> H
H --> I[Report Gen<br/>+ Auto-fix]
I --> J[OTel Export<br/>JSONL/OTLP/HTTP]
B -.->|policy gate| K[MCP Runtime<br/>tool allowlist]
style A fill:#dbeafe,stroke:#2563eb
style D fill:#fef3c7,stroke:#d97706
style H fill:#dcfce7,stroke:#16a34a
style J fill:#f3e8ff,stroke:#9333ea
The runner is a typed Python state machine that owns phases, retries, timeouts, and error recovery. CLI providers are isolated adapters. RAG runs locally over the project tree (no external vector store). MCP tool calls pass through a policy gate with allowlist and side-effect approval.
Runtime model
The modern runtime lives in arena_runtime/ and is entered through:
python3 scripts/arena-runtime.py <command> [args]
Shell is not used as the orchestration layer. Remaining shell files are tests or developer support scripts, not the review runtime.
Core areas:
arena_runtime/entrypoint.py: command dispatch.arena_runtime/provider_runner.py: CLI provider adapters.arena_runtime/rag_runtime.py: evidence retrieval.arena_runtime/benchmarking.py: deterministic and live benchmark commands.arena_runtime/harness.py: event bus and OTel export/push.arena_runtime/mcp_runtime.py: MCP tool-call wrapper and stdio server.arena_runtime/exporters.py: Claude, Codex, and Gemini integration exports.config/default-config.json: default model, policy, RAG, benchmark, and MCP settings.
Quick start
Install the package from a checkout:
python3 -m pip install -e .
arena validate-config config/default-config.json
Validate the local configuration:
python3 scripts/arena-runtime.py validate-config config/default-config.json
Run CLI diagnostics without calling live models:
arena cli-diagnostics --config config/default-config.json
arena provider-smoke --models codex,gemini,claude --timeout 30
Index and retrieve local RAG evidence:
python3 scripts/arena-runtime.py rag-indexer . --config config/default-config.json
python3 scripts/arena-runtime.py rag-evidence . security "credential handling" --config config/default-config.json --top-k 5
Run deterministic benchmark paths:
python3 scripts/arena-runtime.py retrieval-benchmark --config config/default-config.json --max-cases 3
python3 scripts/arena-runtime.py benchmark-harness-ablation --config config/default-config.json --max-cases 3
Run a bounded live provider sample when the CLIs are installed and authenticated:
arena benchmark-models --category security --models codex,gemini --live --smoke --timeout 90 --preflight-timeout 30 --require-live-success
Product docs
- Installation
- First review in 5 minutes
- Provider setup
- Security model
- Troubleshooting
- Dashboard
- Release process
Claude Code integration
Claude Code integration is project-local and explicit. Install the hook and agent files into the current project with:
python3 scripts/arena-runtime.py install-claude-integration --project-root .
This writes .claude/settings.json and .claude/agents/*.md. After that, Claude Code can trigger Arena through the configured hooks for this project. It is not a global background service and it does not intercept Claude Code sessions unless the project hook configuration is present and loaded by Claude Code.
Agent file policy:
.claude/agents/is the Claude Code runtime install target..codex/agents/is the Codex-oriented runtime install target.agents/is treated as shared or historical reviewer material only when explicitly referenced by exporters or docs.- Avoid hand-editing generated files in
.claude/agents/if the same change should apply to future exports. Update the exporter source or shared reviewer material instead.
Codex and Gemini integration
Export integration files for supported CLI ecosystems:
python3 scripts/arena-runtime.py export-extension all --output-dir ./dist/extensions
Exports are configuration files and command definitions. Actual review execution still depends on the local Codex, Gemini, or Claude tooling being installed, authenticated, and available in PATH.
MCP runtime
Direct policy-gated tool call:
python3 scripts/arena-runtime.py mcp-tool-call --config config/default-config.json --server local --tool echo --input-json '{"ok":true}'
JSON-RPC stdio server mode:
python3 scripts/arena-runtime.py mcp-stdio-server --config config/default-config.json
The MCP runtime only exposes tools configured under mcp.servers and constrained by mcp.allowed_tools / mcp.side_effect_tools. Side-effect tools require explicit approval.
Observability
Emit a harness event:
python3 scripts/arena-runtime.py harness-event review.started --phase review --field provider=codex
Export events as JSONL or OTLP JSON:
python3 scripts/arena-runtime.py otel-export --run-dir cache/runs/<run-id> --output trace.jsonl
python3 scripts/arena-runtime.py otel-export --run-dir cache/runs/<run-id> --format otlp-json --output trace.otlp.json
Push OTLP JSON to a collector endpoint:
python3 scripts/arena-runtime.py otel-push --run-dir cache/runs/<run-id> --endpoint http://127.0.0.1:4318/v1/traces
Benchmark Results (measured)
Harness ablation result from benchmark-harness-ablation (3 cases, 2026-05-09):
| Scenario | RAG | Debate | Harness Score | |----------|-----|--------|---------------| | review_only (Solo) | – | – | 0.55 | | review_plus_rag | ✅ | – | 0.911 | | review_plus_debate | – | ✅ | 0.63 | | full_harness | ✅ | ✅ | 0.991 |
xychart-beta
title "Harness Ablation — Scenario Comparison"
x-axis ["Solo", "+RAG", "+Debate", "Full"]
y-axis "Score" 0 --> 1
bar [0.55, 0.911, 0.63, 0.991]
RAG is the largest single contributor (+0.36 over Solo). Full harness reaches 0.991 — a +0.44 jump from Solo review.
Retrieval benchmark (BM25 + symbol/import/changed-file/file-hint scoring):
| Test ID | Category | Recall@k | MRR | Hits | |---------|----------|----------|-----|------| | retrieval-extension-01 | architecture | 1.000 | 1.000 | 1/1 | | retrieval-harness-01 | architecture | 1.000 | 1.000 | 1/1 | | retrieval-security-01 | security | 1.000 | 0.625 | 2/2 | | Aggregate | | 1.000 | 0.875 | 4/4 |
Reproduce locally:
python3 scripts/arena-runtime.py retrieval-benchmark --config config/default-config.json --max-cases 10
python3 scripts/arena-runtime.py benchmark-harness-ablation --config config/default-config.json --max-cases 10
Benchmarking strategy
Arena separates deterministic harness validation from live model validation.
- Deterministic tests verify schema handling, RAG attachment, aggregation, reports, retrieval fixtures, OTel export, and MCP boundaries.
- Retrieval benchmarks measure expected-file recall and mean reciprocal rank against dedicated retrieval fixtures.
- Harness ablations compare review behavior with and without RAG, boundary checks, evidence attachment, and aggregation paths.
- Live provider samples are intentionally bounded by timeout, case count, and model list because local CLIs can block on auth, updates, network, or interactive prompts.
Verification
Run targeted runtime checks:
python3 -m compileall arena_runtime scripts/arena-runtime.py
bash tests/unit/test-harness-rag-runtime.sh
bash tests/unit/test-generate-report.sh
python3 scripts/arena-runtime.py retrieval-benchmark --config config/default-config.json --max-cases 3
Run the full suite:
bash tests/run-tests.sh --all
Security boundaries
Arena treats external model output, RAG chunks, and MCP tool inputs as untrusted data. Runtime policy covers:
- prompt-injection detection for retrieved context;
- MCP tool allowlists;
- side-effect approval gates;
- restricted subprocess environments for configured tool calls;
- evidence metadata in findings and reports.
These controls reduce risk but do not replace human approval for destructive or high-impact actions.




