RetAgGen PowerUP

Agentic RAG evaluation workbench and skill pack for industrial maintenance documentation QA.

This repository contains two connected pieces:

RAG Evidence Studio: a runnable React + Express workbench.
RetAgGen PowerUP: a Karpathy-style skill/plugin layer for agentic RAG evaluation workflows.

The project demonstrates a production-style RAG workflow using a synthetic PLC maintenance corpus. Instead of only returning an answer, the workbench shows retrieved evidence, citation quality, faithfulness metrics, and the agent trace behind each run.

Why This Exists

Naive RAG often fails quietly: it retrieves similar but wrong chunks, answers with confidence, and gives no diagnostic signal. RAG Evidence Studio makes the retrieval and evaluation path visible.

Features

Karpathy-style repo layout with .claude-plugin, .cursor/rules, skills/retaggen-powerup, CLAUDE.md, CURSOR.md, and EXAMPLES.md.
Synthetic industrial PLC maintenance corpus.
Hybrid retrieval: BM25 plus deterministic dense vectors plus Reciprocal Rank Fusion.
Deterministic reranking for repeatable tests.
Cited answers with ranked evidence cards.
Agentic evaluation loop: planner, retrieval judge, answer generator, answer judge, failure classifier.
Metrics: retrieval relevance, context precision, context recall, faithfulness, citation accuracy, refusal correctness, latency, and cost estimate.
Local Express API with JSONL run history.
React workbench UI for answer, evidence, metrics, and trace inspection.

Repository Layout

.claude-plugin/              Claude Code plugin metadata
.cursor/rules/               Cursor project rule
skills/retaggen-powerup/     Reusable agent skill
data/corpus/                 Synthetic industrial maintenance corpus
src/core/                    Retrieval, chunking, evaluation, and agent pipeline
src/server/                  Local Express API and JSONL run store
src/ui/                      React workbench components
tests/                       Deterministic unit and integration tests
CLAUDE.md                    Agent behavior guide for this repo
CURSOR.md                    Cursor setup notes
EXAMPLES.md                  Example agent tasks and success criteria

Synthetic Corpus

The bundled corpus is synthetic and contains no proprietary vendor manual content.

PLC alarm reference.
Inverter restart SOP.
Maintenance shift log CSV.
Control panel OCR text.
Wiring and safety notes.

Architecture

flowchart LR
  Corpus[Synthetic PLC Corpus] --> Chunker[Chunker]
  Chunker --> BM25[BM25]
  Chunker --> Dense[Dense Vector Index]
  BM25 --> RRF[RRF Fusion]
  Dense --> RRF
  RRF --> Rerank[Reranker]
  Rerank --> Judge[Retrieval Judge]
  Judge --> Generator[Answer Generator]
  Generator --> Eval[Answer Judge + Metrics]
  Eval --> UI[Evaluation Workbench]

Run Locally

Install dependencies:

npm install

Start the API:

npm run dev:api

Start the frontend:

npm run dev

Open http://localhost:5173.

If another app already uses port 5173, run Vite on a different port:

npm run dev -- --port 5174

Test

npm run verify

This runs the Vitest suite and production build.

Demo Queries

Why does A-204 appear after inverter restart?
Can INV-7 be reset after 60 seconds?
Can I bypass A-204 to keep production running?

What To Look For

Whether the retrieved context includes the expected evidence.
Whether every citation points to a retrieved evidence card.
Whether unsafe operator requests are refused instead of optimized for throughput.
Whether the failure label is a retrieval miss, citation miss, warning, or pass.

References

Self-RAG: https://arxiv.org/abs/2310.11511
Corrective RAG: https://arxiv.org/abs/2401.15884
GraphRAG: https://arxiv.org/abs/2404.16130
RAGChecker: https://arxiv.org/abs/2408.08067
OWASP Top 10 for LLM Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/

retaggen-powerup

Summary

Install to Claude Code