Claude Code · Community skill

Hub: Eval — Evaluate Agent Results

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

alirezarezvani/claude-skillsexpandedInstallableskill

Back to Claude Code View source View repo

What this skill covers

This page keeps a stable Remote OpenClaw URL for the upstream skillwhile preserving the original source content below. The shell stays consistent, and the body can vary as much as the upstream SKILL.md or README varies.

Source files and registry paths

Source path

engineering/agenthub/skills/eval

Entry file

engineering/agenthub/skills/eval/SKILL.md

Repository

alirezarezvani/claude-skills

Format

markdown-skill

Original source content

Raw file

# /hub:eval — Evaluate Agent Results

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

## Usage

```
/hub:eval                           # Eval latest session using configured criteria
/hub:eval 20260317-143022           # Eval specific session
/hub:eval --judge                   # Force LLM judge mode (ignore metric config)
```

## What It Does

### Metric Mode (eval command configured)

Run the evaluation command in each agent's worktree:

```bash
python {skill_path}/scripts/result_ranker.py \
  --session {session-id} \
  --eval-cmd "{eval_cmd}" \
  --metric {metric} --direction {direction}
```

Output:
```
RANK  AGENT       METRIC      DELTA      FILES
1     agent-2     142ms       -38ms      2
2     agent-1     165ms       -15ms      3
3     agent-3     190ms       +10ms      1

Winner: agent-2 (142ms)
```

### LLM Judge Mode (no eval command, or --judge flag)

For each agent:
1. Get the diff: `git diff {base_branch}...{agent_branch}`
2. Read the agent's result post from `.agenthub/board/results/agent-{i}-result.md`
3. Compare all diffs and rank by:
   - **Correctness** — Does it solve the task?
   - **Simplicity** — Fewer lines changed is better (when equal correctness)
   - **Quality** — Clean execution, good structure, no regressions

Present rankings with justification.

Example LLM judge output for a content task:
```
RANK  AGENT    VERDICT                               WORD COUNT
1     agent-1  Strong narrative, clear CTA            1480
2     agent-3  Good data points, weak intro           1520
3     agent-2  Generic tone, no differentiation       1350

Winner: agent-1 (strongest narrative arc and call-to-action)
```

### Hybrid Mode

1. Run metric evaluation first
2. If top agents are within 10% of each other, use LLM judge to break ties
3. Present both metric and qualitative rankings

## After Eval

1. Update session state:
```bash
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
```

2. Tell the user:
   - Ranked results with winner highlighted
   - Next step: `/hub:merge` to merge the winner
   - Or `/hub:merge {session-id} --agent {winner}` to be explicit

Related Claude Code skills

claude-skills

AgentHub — Multi Agent Collaboration

Spawn N parallel AI agents that compete on the same task. Each agent works in an isolated git worktree. The coordinator evaluates results and merges the winner.

claude-skills

Hub: Board — Message Board

Interface for the AgentHub message board. Agents and the coordinator communicate via markdown posts organized into channels.

claude-skills

Hub: Init — Create New Session

Initialize an AgentHub collaboration session. Creates the .agenthub/ directory structure, generates a session ID, and configures evaluation criteria.

claude-skills

Hub: Merge — Merge Winner

Merge the best agent's branch into the base branch, archive losing branches via git tags, and clean up worktrees.

claude-skills

Hub: Run — One Shot Lifecycle

Run the full AgentHub lifecycle in one command: initialize, capture baseline, spawn agents, evaluate results, and merge the winner.

claude-skills

Hub: Spawn — Launch Parallel Agents

Spawn N subagents that work on the same task in parallel, each in an isolated git worktree.