Claude Skill

Skill Evaluator

Audit all skills in the current project for frontmatter completeness, effort level appropriateness, allowed-tools scoping, and content quality. Produces a scored report with effort-level recommendations for each skill. Use when onboarding to a new project, reviewing skill quality before shipping, or adding effort fields to an existing skill library.

Editor's Note

Audit all skills in the current project for frontmatter completeness, effort level appropriateness, allowed-tools scoping, and content quality. Produces a scored report with effort-level recommendations for each skill. Use when onboarding to a new project,... Covers when to use, what gets audited, scoring criteria (14 pts per skill).

Page Outline

When to UseWhat Gets AuditedScoring Criteria (14 pts per skill)Effort Level Inference EngineExecution InstructionsSummaryPer-Skill ResultsFix Summary Format

Source Content

Normalized top-level metadata comes from the directory layer. The body below is the upstream source content for this item.

Skill Evaluator

Discover all skills in the project, score them across 6 criteria, and infer the appropriate `effort` level based on content analysis.

When to Use

  • New project: run once to establish baseline quality
  • Before committing a skill to a team repo
  • After bulk-importing skills from another project
  • When adding `effort` fields for the first time (v2.1.80+)

What Gets Audited

All `SKILL.md` files and flat `.md` files found in:

  • `.claude/skills/**`
  • `~/.claude/skills/**` (if requested)
  • Any path passed as argument: `/eval-skills ./my-skills-dir`

---

Scoring Criteria (14 pts per skill)

| # | Criterion | Max | What is checked | |---|-----------|-----|-----------------| | 1 | **name** | 1 | Present, lowercase, hyphens only, matches directory name | | 2 | **description** | 2 | Present + has "Use when" / "when to" / trigger phrasing | | 3 | **allowed-tools** | 2 | Present + not overly broad (Bash without scoping when read-only) | | 4 | **effort** | 3 | Present (1pt) + appropriate for content (2pt based on inference) | | 5 | **content structure** | 4 | Has Purpose/When section (1), has examples/usage (1), has clear workflow (1), no placeholder text (1) | | 6 | **bonus** | +2 | argument-hint present (1), version/author metadata (1) |

> **Note**: `tags` is NOT an officially supported frontmatter field in Claude Code. It is ignored by the runtime. Do not include it or score it as a quality criterion.

**Thresholds:**

  • ✅ Good: ≥11/14 (≥80%)
  • ⚠️ Needs work: 8–10/14 (60–79%)
  • ❌ Fix: <8/14 (<60%)

---

Effort Level Inference Engine

For each skill, analyze description + content and classify using these signals:

`low` — Mechanical execution, no design decisions

Signals:

  • Verbs: commit, push, sync, scaffold, generate (template-based), format, rename, bump, wrap, convert
  • No reasoning required: sequential steps, template instantiation, data fetching
  • allowed-tools: Bash only, or Read-only
  • No sub-agents spawned
  • Short workflow (<5 steps)

Examples: `/commit`, `/release-notes`, `/scaffold`, `/sync`, `/format`

`medium` — Analysis with bounded scope, categorization

Signals:

  • Verbs: review, triage, analyze, categorize, suggest, evaluate (single file or bounded scope)
  • Requires pattern recognition but not architectural reasoning
  • allowed-tools: Read + Grep + Bash combination
  • May spawn 1-2 sub-agents but with predefined scope
  • Produces structured output (tables, categorized lists)

Examples: `/code-review` (single PR), `/issue-triage`, `/dependency-audit`, `/test-coverage`

`high` — Design decisions, adversarial reasoning, cross-system analysis

Signals:

  • Verbs: architect, redesign, threat-model, audit (security), orchestrate (multi-agent), score, assess trade-offs
  • Requires reasoning about edge cases, attack vectors, or system-wide implications
  • allowed-tools: broad access (Read + Write + Bash + external tools)
  • Spawns multiple sub-agents or uses parallel execution
  • Produces analysis with explicit uncertainty or trade-off sections
  • Keywords in content: "security", "architecture", "adversarial", "pipeline", "threat", "design decision"

Examples: `/security-audit`, `/architecture-review`, `/cyber-defense`, `/eval-agents`

Mismatch flag

If a skill has `effort:` already set but the inferred level differs, flag it: > ⚠️ Effort mismatch: declared `low`, inferred `high` — skill spawns 4 sub-agents and performs security analysis

---

Execution Instructions

Step 1 — Discovery

# Find all SKILL.md files
find .claude/skills -name "SKILL.md" 2>/dev/null

# Find flat skill files
find .claude/skills -maxdepth 1 -name "*.md" ! -name "README*" 2>/dev/null

# If argument provided, use that path instead

Step 2 — Parse each skill

For each skill file found:

  • Read the full file
  • Extract YAML frontmatter (between first `---` and second `---`)
  • Parse: name, description, allowed-tools, effort, argument-hint, version
  • Note presence/absence of each field
  • Read the body content for structure analysis

Step 3 — Score and infer

Apply the scoring criteria above to each skill:

  • Check frontmatter fields
  • Evaluate description quality (does it answer "when to use"? is it under 1024 chars?)
  • Evaluate allowed-tools scope (is Bash used when only Read would suffice? are tools scoped with wildcards when possible?)
  • Infer effort level from content analysis
  • Compare inferred vs declared effort (if set)
  • Evaluate content structure (scan for "When to Use", "Purpose", "Example", "Workflow" sections)

Step 4 — Output

Produce a structured report:

# Skills Audit — [project name or path]
Date: [today] | Scanned: N skills

## Summary
| Status | Count |
|--------|-------|
| ✅ Good (≥80%) | N |
| ⚠️ Needs work (60–79%) | N |
| ❌ Fix (<60%) | N |

**Effort coverage**: N/N skills have effort field set

---

## Per-Skill Results

### [skill-name] — [score]/15 [✅/⚠️/❌]

| Criterion | Score | Notes |
|-----------|-------|-------|
| name | ✅ 1/1 | — |
| description | ⚠️ 1/2 | Missing "Use when" phrasing |
| allowed-tools | ✅ 2/2 | Well-scoped |
| effort | ❌ 0/3 | Missing — Recommended: high |
| content structure | ⚠️ 2/4 | No examples section |

**Effort inference**: `high` — skill performs security analysis with adversarial reasoning
  Signals: "threat", "attack surface", "vulnerability scoring" in content; spawns 4 agents

**Priority fixes** (ordered by impact):
1. Add `effort: high` to frontmatter
2. Add "Use when" to description
3. Add a concrete usage example section

---

After all skills: print a **Fix Summary** — all missing effort fields with recommended values, ready to copy-paste.

---

Fix Summary Format

At the end, print a ready-to-use patch block for all missing/mismatched effort fields:

## Recommended effort fields (copy-paste ready)

skill-name-1: effort: low     # mechanical scaffold
skill-name-2: effort: high    # security analysis, spawns agents
skill-name-3: effort: medium  # code review, bounded scope

And a 1-line count: `N skills need effort field · N mismatches · N missing allowed-tools`

Related Items

Deploy agents, MCP servers, and backends fast logo

Railway - Deploy agents and MCP servers fast

Try Railway