CodeGraph Quality Audit
Measures how much CodeGraph helps an agent versus plain grep/read, for a chosen codegraph version on a chosen real-world repo. Drives the harness in scripts/agent-eval/.
Prerequisites
tmux3+, a logged-inclaudeCLI,node,git(macOS/Linux).- Run from the codegraph repo root.
Workflow
Copy this checklist:
- [ ] 1. Pick version (local or npm)
- [ ] 2. Pick language
- [ ] 3. Pick repo by size
- [ ] 4. Pick harness (headless / tmux / both)
- [ ] 5. Run audit.sh in the background
- [ ] 6. Report results
Step 1 — version. Ask with AskUserQuestion: which codegraph version to test. Offer "Local dev build" and "Latest published"; the free-text "Other" lets the user type a specific version (e.g. 0.7.10). Map the answer to a VERSION token:
- "Local dev build" →
local - "Latest published" →
latest - a typed version → that string (e.g.
0.7.10)
Step 2 — language. Read .claude/skills/agent-eval/corpus.json. Ask with AskUserQuestion which language to test, listing the languages that have entries.
Step 3 — repo. From the chosen language's entries, ask which repo. Label each option with its size and file count, e.g. excalidraw — Medium (~600 files). Each entry carries the repo URL and a representative question.
Step 4 — harness. Ask with AskUserQuestion which harness to run, and map the answer to a MODE token:
- "Headless" →
headless—claude -pwith stream-json: exact tokens/cost and a
clean tool sequence (2 runs, fast, no TTY).
- "Interactive (tmux)" →
tmux— drives the real Claude TUI in tmux: faithful
Explore-subagent behavior, metrics from session logs (2 runs, slower).
- "Both" →
all— headless + interactive (4 runs).
Step 5 — run. Launch in the background (sets the version, clones if missing, wipes + re-indexes, runs the chosen arms — several minutes):
scripts/agent-eval/audit.sh <VERSION> <repo-name> <repo-url> "<question>" <MODE>
Step 6 — report. When the job finishes, read the log and report per arm:
- Headless (
parse-run.mjs): total tool calls, fileReads, Grep/Bash,
codegraph-tool calls, duration, total cost.
- Interactive (
parse-session.mjs): the `VERDICT: codegraph_explore used Nx |
Read N | Grep/Bash N and TOKENS:` lines.
Lead with cost + tool/Read counts — they are the reliable signals; raw token in/out are confounded by subagent delegation and prompt caching. State whether codegraph reduced effort and whether both arms reached a correct answer.
Notes
- The index is rebuilt every run (
audit.shwipes.codegraph) — different
versions extract differently, so an index must be served by the same binary that built it.
audit.shtemporarily mutates the globalcodegraphinstall for the test,
then restores your dev link via local-install.sh.
- Corpus repos are cloned to
/tmp/codegraph-corpus(reused if already present). - Add or edit repos in
corpus.json(fields:name,repo,size,files,
question).

