Systematic Debugging
Overview
Random fixes waste time and create new bugs. Quick patches mask underlying issues.
**Core principle:** ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
**Violating the letter of this process is violating the spirit of debugging.**
The Iron Law
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRSTIf you haven't completed Phase 1, you cannot propose fixes.
When to Use
Use for ANY technical issue:
- Test failures
- Bugs in production
- Unexpected behavior
- Performance problems
- Build failures
- Integration issues
**Use this ESPECIALLY when:**
- Under time pressure (emergencies make guessing tempting)
- "Just one quick fix" seems obvious
- You've already tried multiple fixes
- Previous fix didn't work
- You don't fully understand the issue
**Don't skip when:**
- Issue seems simple (simple bugs have root causes too)
- You're in a hurry (rushing guarantees rework)
- Manager wants it fixed NOW (systematic is faster than thrashing)
The Four Phases
You MUST complete each phase before proceeding to the next.
Phase 1: Root Cause Investigation
**BEFORE attempting ANY fix:**
- **Read Error Messages Carefully**
- Don't skip past errors or warnings
- They often contain the exact solution
- Read stack traces completely
- Note line numbers, file paths, error codes
- **Reproduce Consistently**
- Can you trigger it reliably?
- What are the exact steps?
- Does it happen every time?
- If not reproducible → gather more data, don't guess
- **Check Recent Changes**
- What changed that could cause this?
- Git diff, recent commits
- New dependencies, config changes
- Environmental differences
- **Gather Evidence in Multi-Component Systems**
**WHEN system has multiple components (CI → build → signing, API → service → database):**
**BEFORE proposing fixes, add diagnostic instrumentation:**
For EACH component boundary:
- Log what data enters component
- Log what data exits component
- Verify environment/config propagation
- Check state at each layer
Run once to gather evidence showing WHERE it breaks
THEN analyze evidence to identify failing component
THEN investigate that specific component**Example (multi-layer system):**
# Layer 1: Workflow
echo "=== Secrets available in workflow: ==="
echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"
# Layer 2: Build script
echo "=== Env vars in build script: ==="
env | grep IDENTITY || echo "IDENTITY not in environment"
# Layer 3: Signing script
echo "=== Keychain state: ==="
security list-keychains
security find-identity -v
# Layer 4: Actual signing
codesign --sign "$IDENTITY" --verbose=4 "$APP"**This reveals:** Which layer fails (secrets → workflow ✓, workflow → build ✗)
- **Trace Data Flow**
**WHEN error is deep in call stack:**
See `root-cause-tracing.md` in this directory for the complete backward tracing technique.
**Quick version:**
- Where does bad value originate?
- What called this with bad value?
- Keep tracing up until you find the source
- Fix at source, not at symptom
Phase 2: Pattern Analysis
**Find the pattern before fixing:**
- **Find Working Examples**
- Locate similar working code in same codebase
- What works that's similar to what's broken?
- **Compare Against References**
- If implementing pattern, read reference implementation COMPLETELY
- Don't skim - read every line
- Understand the pattern fully before applying
- **Identify Differences**
- What's different between working and broken?
- List every difference, however small
- Don't assume "that can't matter"
-
- **Understand Dependencies**
- What other components does this need?
<!-- truncated -->