Claude Skill

Audit Agents/Skills/Commands (Advanced Skill)

Audit Claude Code agents, skills, and commands for quality and production readiness. Use when evaluating skill quality, checking production readiness scores, or comparing agents against best-practice templates.

Editor's Note

Audit Claude Code agents, skills, and commands for quality and production readiness. Use when evaluating skill quality, checking production readiness scores, or comparing agents against best-practice templates. Covers purpose, modes, methodology.

Page Outline

PurposeModesMethodologyWorkflowSource VerificationScoring CriteriaDetection PatternsIndustry Context

Source Content

Normalized top-level metadata comes from the directory layer. The body below is the upstream source content for this item.

Audit Agents/Skills/Commands (Advanced Skill)

Comprehensive quality audit system for Claude Code agents, skills, and commands. Provides quantitative scoring, comparative analysis, and production readiness grading based on industry best practices.

Purpose

**Problem**: Manual validation of agents/skills is error-prone and inconsistent. According to the LangChain Agent Report 2026, 29.5% of organizations deploy agents without systematic evaluation, leading to "agent bugs" as the top challenge (18% of teams).

**Solution**: Automated quality scoring across 16 weighted criteria with production readiness thresholds (80% = Grade B minimum for production deployment).

**Key Features**:

  • Quantitative scoring (32 points for agents/skills, 20 for commands)
  • Weighted criteria (Identity 3x, Prompt 2x, Validation 1x, Design 2x)
  • Production readiness grading (A-F scale with 80% threshold)
  • Comparative analysis vs reference templates
  • JSON/Markdown dual output for programmatic integration
  • Fix suggestions for failing criteria

---

Modes

| Mode | Usage | Output | |------|-------|--------| | **Quick Audit** | Top-5 critical criteria only | Fast pass/fail (3-5 min for 20 files) | | **Full Audit** | All 16 criteria per file | Detailed scores + recommendations (10-15 min) | | **Comparative** | Full + benchmark vs templates | Analysis + gap identification (15-20 min) |

**Default**: Full Audit (recommended for first run)

---

Methodology

Why These Criteria?

The 16-criteria framework is derived from:

  • **Claude Code Best Practices** (Ultimate Guide line 4921: Agent Validation Checklist)
  • **Industry Data** (LangChain Agent Report 2026: evaluation gaps)
  • **Production Failures** (Community feedback on hardcoded paths, missing error handling)
  • **Composition Patterns** (Skills should reference other skills, agents should be modular)

Scoring Philosophy

**Weight Rationale**:

  • **Identity (3x)**: If users can't find/invoke the agent, quality is irrelevant (discoverability > quality)
  • **Prompt (2x)**: Determines reliability and accuracy of outputs
  • **Validation (1x)**: Improves robustness but is secondary to core functionality
  • **Design (2x)**: Impacts long-term maintainability and scalability

**Grade Standards**:

  • **A (90-100%)**: Production-ready, minimal risk
  • **B (80-89%)**: Good, meets production threshold
  • **C (70-79%)**: Needs improvement before production
  • **D (60-69%)**: Significant gaps, not production-ready
  • **F (<60%)**: Critical issues, requires major refactoring

**Industry Alignment**: The 80% threshold aligns with software engineering best practices for production deployment (e.g., code coverage >80%, security scan pass rates).

---

Workflow

Phase 1: Discovery

  • **Scan directories**:
   .claude/agents/
   .claude/skills/
   .claude/commands/
   examples/agents/      (if exists)
   examples/skills/      (if exists)
   examples/commands/    (if exists)
  • **Classify files** by type (agent/skill/command)
  • **Load reference templates** (for Comparative mode):
   guide/examples/agents/     (benchmark files)
   guide/examples/skills/     (benchmark files)
   guide/examples/commands/   (benchmark files)

Phase 2: Scoring Engine

Load scoring criteria from `scoring/criteria.yaml`:

agents:
  max_points: 32
  categories:
    identity:
      weight: 3
      criteria:
        - id: A1.1
          name: "Clear name"
          points: 3
          detection: "frontmatter.name exists and is descriptive"
        # ... (16 total criteria)

For each file:

  • Parse frontmatter (YAML)
  • Extract content sections
  • Run detection patterns (regex, keyword search)
  • Calculate score: `(points / max_points) × 100`
  • Assign grade (A-F)

Phase 3: Comparative Analysis (Comparative Mode Only)

For each project file:

  • Find closest matching template (by description similarity)
  • Compare scores per criterion
  • Identify gaps: `template_score - project_score`
  • Flag significant gaps (>10 points difference)

**Example**:

Project file: .claude/agents/debugging-specialist.md (Score: 78%, Grade C)
Closest template: examples/agents/debugging-specialist.md (Score: 94%, Grade A)

Gaps:
- Anti-hallucination measures: -2 points (template has, project missing)
- Edge cases documented: -1 point (template has 5 examples, project has 1)
- Integration documented: -1 point (template references 3 skills, project none)

Total gap: 16 points (explains C vs A difference)

Phase 4: Report Generation

**Markdown Report** (`audit-report.md`):

  • Summary table (overall + by type)
  • Individual scores with top issues
  • Detailed breakdown per file (collapsible)
  • Prioritized recommendations

**JSON Output** (`audit-report.json`):

{
  "metadata": {
    "project_path": "/path/to/project",
    "audit_date": "2026-02-07",
    "mode": "full",
    "version": "1.0.0"
  },
  "summary": {
    "overall_score": 82.5,
    "overall_grade": "B",
    "total_files": 15,
    "production_ready_count": 10,
    "production_ready_percentage": 66.7
  },
  "by_type": {
    "agents": { "count": 5, "avg_score": 85.2, "grade": "B" },
    "skills": { "count": 8, "avg_score": 78.9, "grade": "C" },
    "commands": { "count": 2, "avg_score": 92.0, "grade": "A" }
  },
  "files": [
    {
      "path": ".claude/agents/debugging-specialist.md",
      "type": "agent",
      "score": 78.1,
      "grade": "C",
      "points_obtained": 25,
      "points_max": 32,
      "failed_criteria": [
        {
          "id": "A2.4",
          "name": "Anti-hallucination measures",
          "points_lost": 2,
          "recommendation": "Add section on source verification"
        }
      ]
    }
  ],
  "top_issues": [
    {
      "issue": "Missing error handling",
      "affected_files": 8,
      "impact": "Runtime failures unhandled",
      "priority": "high"
    }
  ]
}

Phase 5: Fix Suggestions (Optional)

For each failing criterion, generate **actionable fix**:

### File: .claude/agents/debugging-specialist.md
**Issue**: Missing anti-hallucination measures (2 points lost)

**Fix**:
Add this section after "Methodology":

## Source Verification

- Always cite sources for technical claims
- Use phrases: "According to [documentation]...", "Based on [tool output]..."
- If uncertain, state: "I don't have verified information on..."
- Never invent: statistics, version numbers, API signatures, stack traces

**Detection**: Grep for keywords: "verify", "cite", "source", "evidence"

---

Scoring Criteria

See `scoring/criteria.yaml` for complete definitions. Summary:

Agents (32 points max)

| Category | Weight | Criteria Count | Max Points | |----------|--------|----------------|------------| | Identity | 3x | 4 | 12 | | Prompt Quality | 2x | 4 | 8 | | Validation | 1x | 4 | 4 | | Design | 2x | 4 | 8 |

**Key Criteria**:

  • Clear name (3 pts): Not generic like "agent1"
  • Description with triggers (3 pts): Contains "when"/"use"
  • Role defined (2 pts): "You are..." statement
  • 3+ examples (1 pt): Usage scenarios documented
  • Single responsibility (2 pts): Focused, not "general purpose"

Skills (32 points max)

| Category | Weight | Criteria Count | Max Points | |----------|--------|----------------|------------| | Structure | 3x | 4 | 12 | | Content | 2x | 4 | 8 | | Technical | 1x | 4 | 4 | | Design | 2x | 4 | 8 |

**Key Criteria**:

  • Valid SKILL.md (3 pts): Proper naming
  • Name valid (3 pts): Lowercase, 1-64 chars, no spaces
  • Methodology described (2 pts): Workflow section exists
  • No hardcoded paths (1 pt): No `/Users/`, `/home/`
  • Clear triggers (2 pts): "When to use" section

Commands (20 points max)

| Category | Weight | Criteria Count | Max Points | |----------|--------|----------------|------------| | Structure | 3x | 4 | 12 | | Quality | 2x | 4 | 8 |

**Key Criteria**:

  • Valid frontmatter (3 pts): name + description
  • Argument hint (3 pts): If uses `$ARGUMENTS`
  • Step-by-step workflow (3 pts): Numbered sections
  • Error handling (2 pts): Mentions failure modes

---

Detection Patterns

Frontmatter Parsing

import yaml
import re

def parse_frontmatter(content):
    match = re.search(r'^---\n(.*?)\n---', content, re.DOTALL)
    if match:
        return yaml.safe_load(match.group(1))
    return None

Keyword Detection

def has_keywords(text, keywords):
    text_lower = text.lower()
    return any(kw in text_lower for kw in keywords)

# Example
has_trigger = has_keywords(description, ['when', 'use', 'trigger'])
has_error_handling = has_keywords(content, ['error', 'failure', 'fallback'])

Overlap Detection (Duplication Check)

def jaccard_similarity(text1, text2):
    words1 = set(text1.lower().split())
    words2 = set(text2.lower().split())
    intersection = words1 & words2
    union = words1 | words2
    return len(intersection) / len(union) if union else 0

# Flag if similarity > 0.5 (50% keyword overlap)
if jaccard_similarity(desc1, desc2) > 0.5:
    issues.append("High overlap with another file")

Token Counting (Approximate)

def estimate_tokens(text):
    # Rough estimate: 1 token ≈ 0.75 words
    word_count = len(text.split())
    return int(word_count * 1.3)

# Check budget
tokens = estimate_tokens(file_content)
if tokens > 5000:
    issues.append("File too large (>5K tokens)")

---

Industry Context

**Source**: LangChain Agent Report 2026 (public report, page 14-22)

**Key Findings**:

  • **29.5%** of organizations deploy agents without systematic evaluation
  • **18%** cite "agent bugs" as their primary challenge
  • **Only 12%** use automated quality checks (88% manual or none)
  • **43%** report difficulty maintaining agent quality over time
  • **Top issues**: Hallucinations (31%), poor error handling (28%), unclear triggers (22%)

**Implications**:

  • **Automation gap**: Most teams rely on manual checklists (error-prone at scale)
  • **Quality debt**: Agents deployed without validation accumulate technical debt
  • **Maintenance burden**: 43% struggle with quality over time (no tracking system)

**This skill addresses**:

  • Automation: Replaces manual checklists with quantitative scoring
  • Tracking: JSON output enables trend analysis over time
  • Standards: 80% threshold provides clear production gate

---

Output Examples

Quick Audit (Top-5 Criteria)

# Quick Audit: Agents/Skills/Commands

**Files**: 15 (5 agents, 8 skills, 2 commands)
**Critical Issues**: 3 files fail top-5 criteria

## Top-5 Criteria (Pass/Fail)

| File | Valid Name | Has Triggers | Error Handling | No Hardcoded Paths | Examples |
|------|------------|--------------|----------------|--------------------|----------|
| agent1.md | ✅ | ✅ | ❌ | ✅ | ❌ |
| skill2/ | ✅ | ❌ | ✅ | ❌ | ✅ |

## Action Required

1. **Add error handling**: 5 files
2. **Remove hardcoded paths**: 3 files
3. **Add usage examples**: 4 files

Full Audit

See Phase 4: Report Generation above for full structure.

Comparative (Full + Benchmarks)

# Comparative Audit

## Project vs Templates

| File | Project Score | Template Score | Gap | Top Missing |
|------|---------------|----------------|-----|-------------|
| debugging-specialist.md | 78% (C) | 94% (A) | -16 pts | Anti-hallucination, edge cases |
| testing-expert/ | 85% (B) | 91% (A) | -6 pts | Integration docs |

## Recommendations

Focus on these gaps to reach template quality:
1. **Anti-hallucination measures** (8 files): Add source verification sections
2. **Edge case documentation** (5 files): Add failure scenario examples
3. **Integration documentation** (4 files): List compatible agents/skills

---

Usage

Basic (Full Audit)

# In Claude Code
Use skill: audit-agents-skills

# Specify path
Use skill: audit-agents-skills for ~/projects/my-app

With Options

# Quick audit (fast)
Use skill: audit-agents-skills with mode=quick

# Comparative (benchmark analysis)
Use skill: audit-agents-skills with mode=comparative

# Generate fixes
Use skill: audit-agents-skills with fixes=true

# Custom output path
Use skill: audit-agents-skills with output=~/Desktop/audit.json

JSON Output Only

# For programmatic integration
Use skill: audit-agents-skills with format=json output=audit.json

---

Integration with CI/CD

Pre-commit Hook

#!/bin/bash
# .git/hooks/pre-commit

# Run quick audit on changed agent/skill/command files
changed_files=$(git diff --cached --name-only | grep -E "^\.claude/(agents|skills|commands)/")

if [ -n "$changed_files" ]; then
    echo "Running quick audit on changed files..."
    # Run audit (requires Claude Code CLI wrapper)
    # Exit with 1 if any file scores <80%
fi

GitHub Actions

name: Audit Agents/Skills
on: [pull_request]
jobs:
  audit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run quality audit
        run: |
          # Run audit skill
          # Parse JSON output
          # Fail if overall_score < 80

---

Comparison: Command vs Skill

| Aspect | Command (`/audit-agents-skills`) | Skill (this file) | |--------|----------------------------------|-------------------| | **Scope** | Current project only | Multi-project, comparative | | **Output** | Markdown report | Markdown + JSON | | **Speed** | Fast (5-10 min) | Slower (10-20 min with comparative) | | **Depth** | Standard 16 criteria | Same + benchmark analysis | | **Fix suggestions** | Via `--fix` flag | Built-in with recommendations | | **Programmatic** | Terminal output | JSON for CI/CD integration | | **Best for** | Quick checks, dev workflow | Deep audits, quality tracking |

**Recommendation**: Use command for daily checks, skill for release gates and quality tracking.

---

Maintenance

Updating Criteria

Edit `scoring/criteria.yaml`:

agents:
  categories:
    identity:
      criteria:
        - id: A1.5  # New criterion
          name: "API versioning specified"
          points: 3
          detection: "mentions API version or compatibility"

Version bump: Increment `version` in frontmatter when criteria change.

Adding File Types

To support new file types (e.g., "workflows"):

  • Add to `scoring/criteria.yaml`:
   workflows:
     max_points: 24
     categories: [...]
  • Update detection logic (file path patterns)
  • Update report templates

---

Related

  • **Command version**: `.claude/commands/audit-agents-skills.md`
  • **Agent Validation Checklist**: guide line 4921 (manual 16 criteria)
  • **Skill Validation**: guide line 5491 (spec documentation)
  • **Reference templates**: `examples/agents/`, `examples/skills/`, `examples/commands/`

---

Changelog

**v1.0.0** (2026-02-07):

  • Initial release
  • 16-criteria framework (agents/skills/commands)
  • 3 audit modes (quick/full/comparative)
  • JSON + Markdown output
  • Fix suggestions
  • Industry context (LangChain 2026 report)

---

**Skill ready for use**: `audit-agents-skills`

Related Items

Deploy agents, MCP servers, and backends fast logo

Railway - Deploy agents and MCP servers fast

Try Railway