Claude Code · Community skill
Hub: Eval — Evaluate Agent Results
Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
What this skill covers
This page keeps a stable Remote OpenClaw URL for the upstream skillwhile preserving the original source content below. The shell stays consistent, and the body can vary as much as the upstream SKILL.md or README varies.
Source files and registry paths
Source path
engineering/agenthub/skills/eval
Entry file
engineering/agenthub/skills/eval/SKILL.md
Repository
alirezarezvani/claude-skills
Format
markdown-skill
Original source content
Raw file# /hub:eval — Evaluate Agent Results
Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
## Usage
```
/hub:eval # Eval latest session using configured criteria
/hub:eval 20260317-143022 # Eval specific session
/hub:eval --judge # Force LLM judge mode (ignore metric config)
```
## What It Does
### Metric Mode (eval command configured)
Run the evaluation command in each agent's worktree:
```bash
python {skill_path}/scripts/result_ranker.py \
--session {session-id} \
--eval-cmd "{eval_cmd}" \
--metric {metric} --direction {direction}
```
Output:
```
RANK AGENT METRIC DELTA FILES
1 agent-2 142ms -38ms 2
2 agent-1 165ms -15ms 3
3 agent-3 190ms +10ms 1
Winner: agent-2 (142ms)
```
### LLM Judge Mode (no eval command, or --judge flag)
For each agent:
1. Get the diff: `git diff {base_branch}...{agent_branch}`
2. Read the agent's result post from `.agenthub/board/results/agent-{i}-result.md`
3. Compare all diffs and rank by:
- **Correctness** — Does it solve the task?
- **Simplicity** — Fewer lines changed is better (when equal correctness)
- **Quality** — Clean execution, good structure, no regressions
Present rankings with justification.
Example LLM judge output for a content task:
```
RANK AGENT VERDICT WORD COUNT
1 agent-1 Strong narrative, clear CTA 1480
2 agent-3 Good data points, weak intro 1520
3 agent-2 Generic tone, no differentiation 1350
Winner: agent-1 (strongest narrative arc and call-to-action)
```
### Hybrid Mode
1. Run metric evaluation first
2. If top agents are within 10% of each other, use LLM judge to break ties
3. Present both metric and qualitative rankings
## After Eval
1. Update session state:
```bash
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
```
2. Tell the user:
- Ranked results with winner highlighted
- Next step: `/hub:merge` to merge the winner
- Or `/hub:merge {session-id} --agent {winner}` to be explicitRelated Claude Code skills
claude-skills
AgentHub — Multi Agent Collaboration
Spawn N parallel AI agents that compete on the same task. Each agent works in an isolated git worktree. The coordinator evaluates results and merges the winner.
claude-skills
Hub: Board — Message Board
Interface for the AgentHub message board. Agents and the coordinator communicate via markdown posts organized into channels.
claude-skills
Hub: Init — Create New Session
Initialize an AgentHub collaboration session. Creates the .agenthub/ directory structure, generates a session ID, and configures evaluation criteria.
claude-skills
Hub: Merge — Merge Winner
Merge the best agent's branch into the base branch, archive losing branches via git tags, and clean up worktrees.
claude-skills
Hub: Run — One Shot Lifecycle
Run the full AgentHub lifecycle in one command: initialize, capture baseline, spawn agents, evaluate results, and merge the winner.
claude-skills
Hub: Spawn — Launch Parallel Agents
Spawn N subagents that work on the same task in parallel, each in an isolated git worktree.