RetAgGen PowerUP
Agentic RAG evaluation workbench and skill pack for industrial maintenance documentation QA.
This repository contains two connected pieces:
RAG Evidence Studio: a runnable React + Express workbench.RetAgGen PowerUP: a Karpathy-style skill/plugin layer for agentic RAG evaluation workflows.
The project demonstrates a production-style RAG workflow using a synthetic PLC maintenance corpus. Instead of only returning an answer, the workbench shows retrieved evidence, citation quality, faithfulness metrics, and the agent trace behind each run.
Why This Exists
Naive RAG often fails quietly: it retrieves similar but wrong chunks, answers with confidence, and gives no diagnostic signal. RAG Evidence Studio makes the retrieval and evaluation path visible.
Features
- Karpathy-style repo layout with
.claude-plugin,.cursor/rules,skills/retaggen-powerup,CLAUDE.md,CURSOR.md, andEXAMPLES.md. - Synthetic industrial PLC maintenance corpus.
- Hybrid retrieval: BM25 plus deterministic dense vectors plus Reciprocal Rank Fusion.
- Deterministic reranking for repeatable tests.
- Cited answers with ranked evidence cards.
- Agentic evaluation loop: planner, retrieval judge, answer generator, answer judge, failure classifier.
- Metrics: retrieval relevance, context precision, context recall, faithfulness, citation accuracy, refusal correctness, latency, and cost estimate.
- Local Express API with JSONL run history.
- React workbench UI for answer, evidence, metrics, and trace inspection.
Repository Layout
.claude-plugin/ Claude Code plugin metadata
.cursor/rules/ Cursor project rule
skills/retaggen-powerup/ Reusable agent skill
data/corpus/ Synthetic industrial maintenance corpus
src/core/ Retrieval, chunking, evaluation, and agent pipeline
src/server/ Local Express API and JSONL run store
src/ui/ React workbench components
tests/ Deterministic unit and integration tests
CLAUDE.md Agent behavior guide for this repo
CURSOR.md Cursor setup notes
EXAMPLES.md Example agent tasks and success criteria
Synthetic Corpus
The bundled corpus is synthetic and contains no proprietary vendor manual content.
- PLC alarm reference.
- Inverter restart SOP.
- Maintenance shift log CSV.
- Control panel OCR text.
- Wiring and safety notes.
Architecture
flowchart LR
Corpus[Synthetic PLC Corpus] --> Chunker[Chunker]
Chunker --> BM25[BM25]
Chunker --> Dense[Dense Vector Index]
BM25 --> RRF[RRF Fusion]
Dense --> RRF
RRF --> Rerank[Reranker]
Rerank --> Judge[Retrieval Judge]
Judge --> Generator[Answer Generator]
Generator --> Eval[Answer Judge + Metrics]
Eval --> UI[Evaluation Workbench]
Run Locally
Install dependencies:
npm install
Start the API:
npm run dev:api
Start the frontend:
npm run dev
Open http://localhost:5173.
If another app already uses port 5173, run Vite on a different port:
npm run dev -- --port 5174
Test
npm run verify
This runs the Vitest suite and production build.
Demo Queries
- Why does A-204 appear after inverter restart?
- Can INV-7 be reset after 60 seconds?
- Can I bypass A-204 to keep production running?
What To Look For
- Whether the retrieved context includes the expected evidence.
- Whether every citation points to a retrieved evidence card.
- Whether unsafe operator requests are refused instead of optimized for throughput.
- Whether the failure label is a retrieval miss, citation miss, warning, or pass.
References
- Self-RAG: https://arxiv.org/abs/2310.11511
- Corrective RAG: https://arxiv.org/abs/2401.15884
- GraphRAG: https://arxiv.org/abs/2404.16130
- RAGChecker: https://arxiv.org/abs/2408.08067
- OWASP Top 10 for LLM Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/






