Audit Agents/Skills/Commands (Advanced Skill)
Comprehensive quality audit system for Claude Code agents, skills, and commands. Provides quantitative scoring, comparative analysis, and production readiness grading based on industry best practices.
Purpose
**Problem**: Manual validation of agents/skills is error-prone and inconsistent. According to the LangChain Agent Report 2026, 29.5% of organizations deploy agents without systematic evaluation, leading to "agent bugs" as the top challenge (18% of teams).
**Solution**: Automated quality scoring across 16 weighted criteria with production readiness thresholds (80% = Grade B minimum for production deployment).
**Key Features**:
- Quantitative scoring (32 points for agents/skills, 20 for commands)
- Weighted criteria (Identity 3x, Prompt 2x, Validation 1x, Design 2x)
- Production readiness grading (A-F scale with 80% threshold)
- Comparative analysis vs reference templates
- JSON/Markdown dual output for programmatic integration
- Fix suggestions for failing criteria
---
Modes
| Mode | Usage | Output | |------|-------|--------| | **Quick Audit** | Top-5 critical criteria only | Fast pass/fail (3-5 min for 20 files) | | **Full Audit** | All 16 criteria per file | Detailed scores + recommendations (10-15 min) | | **Comparative** | Full + benchmark vs templates | Analysis + gap identification (15-20 min) |
**Default**: Full Audit (recommended for first run)
---
Methodology
Why These Criteria?
The 16-criteria framework is derived from:
- **Claude Code Best Practices** (Ultimate Guide line 4921: Agent Validation Checklist)
- **Industry Data** (LangChain Agent Report 2026: evaluation gaps)
- **Production Failures** (Community feedback on hardcoded paths, missing error handling)
- **Composition Patterns** (Skills should reference other skills, agents should be modular)
Scoring Philosophy
**Weight Rationale**:
- **Identity (3x)**: If users can't find/invoke the agent, quality is irrelevant (discoverability > quality)
- **Prompt (2x)**: Determines reliability and accuracy of outputs
- **Validation (1x)**: Improves robustness but is secondary to core functionality
- **Design (2x)**: Impacts long-term maintainability and scalability
**Grade Standards**:
- **A (90-100%)**: Production-ready, minimal risk
- **B (80-89%)**: Good, meets production threshold
- **C (70-79%)**: Needs improvement before production
- **D (60-69%)**: Significant gaps, not production-ready
- **F (<60%)**: Critical issues, requires major refactoring
**Industry Alignment**: The 80% threshold aligns with software engineering best practices for production deployment (e.g., code coverage >80%, security scan pass rates).
---
Workflow
Phase 1: Discovery
- **Scan directories**:
.claude/agents/
.claude/skills/
.claude/commands/
examples/agents/ (if exists)
examples/skills/ (if exists)
examples/commands/ (if exists)- **Classify files** by type (agent/skill/command)
- **Load reference templates** (for Comparative mode):
guide/examples/agents/ (benchmark files)
guide/examples/skills/ (benchmark files)
guide/examples/commands/ (benchmark files)Phase 2: Scoring Engine
Load scoring criteria from `scoring/criteria.yaml`:
agents:
max_points: 32
categories:
identity:
weight: 3
criteria:
- id: A1.1
name: "Clear name"
points: 3
detection: "frontmatter.name exists and is descriptive"
# ... (16
<!-- truncated -->
