Agentic Financial Research Assistant
An agentic system built on LangChain + LangGraph that plans, retrieves, calculates, and compares across RBI financial documents and live market data — with MCP server, guardrails, multi-turn memory, LangSmith observability, human-in-the-loop safety, Yahoo Finance integration, portfolio analysis, and automated evaluation across 18 metrics.
Architecture
graph LR
Start([User Query]) --> Memory[memory_resolver]
Memory --> Planner[planner]
Planner -->|rag_search| RAG[rag_search<br/>BM25+FAISS+RRF]
Planner -->|financial_calculator| Calc[financial_calculator<br/>Safe AST Eval]
Planner -->|document_comparator| Comp[document_comparator<br/>Gemini-based]
Planner -->|web_search| Web[web_search<br/>DuckDuckGo]
Planner -->|yahoo_finance| YF[yahoo_finance<br/>Live Stock Data]
Planner -->|portfolio_analyzer| PA[portfolio_analyzer<br/>Sharpe & Risk]
Planner -->|final_answer| Final[final_answer]
RAG --> Guard[guardrail_check]
Calc --> Guard
Comp --> Guard
Web --> Guard
YF --> Guard
PA --> Guard
Guard -->|continue| Planner
Guard -->|respond| Final
Guard -->|human_review| Human[human_review<br/>Enterprise HITL]
Human --> End([END])
Final --> End
Note["Built on LangChain + LangGraph"]
style Note fill:#e1f5fe,stroke:#01579b,stroke-width:2px
Key Features
- LangChain + LangGraph state machine: Explicit planning → execution → guardrail → validation loop with 11 nodes (9 core + human_review + yahoo_finance + portfolio_analyzer)
- Multi-tool agent: 6 specialized tools (RAG retrieval, financial calculator, document comparator, web search fallback, Yahoo Finance live data, portfolio risk analyzer)
- Fast-path planner: 9 deterministic routing rules reduce LLM calls by ~70% for common queries
- Yahoo Finance integration: Real-time stock quotes, historical prices, returns, volatility, and fundamentals for Indian (.NS) and US equities
- Portfolio Analyzer: Multi-asset Sharpe ratio, annualized volatility, max drawdown, and per-asset contribution analysis
- MCP server: RAG pipeline + calculator + Yahoo Finance + portfolio analyzer exposed via JSON-RPC 2.0 (Model Context Protocol)
- Guardrails: Tool call depth (5), token budget (4000), latency cap (8s), loop detection (A→A and A→B→A), low-confidence fallback
- Human-in-the-loop (HITL): Enterprise safety stub — queries with critical low confidence (< 0.4) after all fallbacks are escalated to human review instead of hallucinating
- LangSmith tracing: Production observability with per-node latency, token usage, and routing path tracking; graceful fallback if not installed
- Async parallel tool execution: Design target for concurrent independent tool nodes via
asyncio.gatherto cut latency by ~40% - Multi-turn memory: LLM-based coreference resolution + regex fallback with sliding window (last 5 turns)
- 18-metric evaluation: Reliability, quality, efficiency, and safety metrics with LLM-as-judge
- Full trace observability: Every step logged with latency, tokens, confidence, and cost estimation
- Adversarial testing: 10 prompt injection tests (system prompt exfiltration, role override, false premise, SQL injection)
Quick Start
# 1. Clone
git clone https://github.com/Ajay-Kumar64/financial-agent.git
cd financial-agent
# 2. Environment
cp .env.example .env
# Add your GOOGLE_API_KEY to .env
# Optional: Add LANGSMITH_API_KEY for production tracing
# 3. One-command startup (all 3 services)
docker compose up --build
# 4. Or run locally
pip install -r requirements.txt
make run # API at http://localhost:8000
make ui # Streamlit at http://localhost:8501
make mcp # MCP server (stdio)
API Endpoints
| Method | Endpoint | Description | |--------|----------|-------------| | POST | /api/v1/chat | Main agent chat endpoint | | GET | /api/v1/health | Health check + dependency status | | GET | /api/v1/trace/{conversation_id} | Full conversation trace | | POST | /api/v1/evaluate | Run golden trace evaluation |
Agent Trace Examples
Multi-Turn Success Trace
Turn 1
- User: "What was the repo rate in FY2023?"
- Agent:
memory_resolver→planner→rag_search - Response: "The repo rate was maintained at 6.5% in FY2023. [Source: RBI Annual Report 2023, Page 47]"
Turn 2
- User: "And what about the previous year?"
- Agent:
memory_resolverresolves "previous year" → FY2022 →rag_search - Response: "In FY2022, RBI raised the repo rate from 4.0% to 6.5%."
Turn 3
- User: "What's the percentage increase between those two?"
- Agent:
memory_resolverresolves to "percentage increase from 4.0% to 6.5%" →financial_calculator→((6.5 - 4.0) / 4.0) * 100 - Response: "The cumulative increase was 62.5%. [Calculated: ((6.5 - 4.0) / 4.0) * 100]"
Trace Summary: 3 turns, 3 tool calls, 0 guardrails triggered, ~₹0.03 cost.
Live Stock Query Trace
Turn 1
- User: "What is the current stock price of RELIANCE.NS?"
- Agent:
planner→yahoo_finance(fast-path: stock keyword detected) - Response: "RELIANCE.NS is trading at ₹2,847.50. 52-week range: ₹2,220.00 - ₹3,024.00."
Trace Summary: 1 turn, 1 tool call, 0.8s latency, ~₹0.005 cost.
Portfolio Analysis Trace
Turn 1
- User: "Analyze a portfolio of 40% RELIANCE, 30% INFY, 30% HDFCBANK"
- Agent:
planner→portfolio_analyzer(fast-path: portfolio keyword detected) - Response: "Portfolio Sharpe Ratio: 1.24 (good risk-adjusted returns). Annualized return: 18.5%, Volatility: 14.2%, Max Drawdown: -12.3%."
Trace Summary: 1 turn, 1 tool call, 2.1s latency, ~₹0.008 cost.
Human-in-the-Loop Safety Trace
Turn 1
- User: "What is the projected repo rate for FY2026?"
- Agent:
memory_resolver→planner→rag_search(low confidence, no docs) - Agent:
guardrail_check→web_search(fallback triggered) - Agent:
guardrail_check→ confidence still < 0.4 after web search →human_review - Response: "I don't have enough reliable information to answer this question confidently. This query has been flagged for human review. Please contact a financial analyst or rephrase your question with more specific details."
Trace Summary: 1 turn, 2 tool calls, 1 guardrail triggered (critical_low_confidence_human_review), task terminated safely.
Yahoo Finance Tool
Fetch live stock data for Indian and US equities:
| Operation | Description | Example | |-----------|-------------|---------| | quote | Current price, market cap, P/E ratio | RELIANCE.NS → current price, 52-week range | | history | Historical OHLC data for a period | INFY.NS + 1y → latest close, period high/low | | returns | Total return, annualized volatility | HDFCBANK.NS + 1y → total return %, volatility % | | fundamentals | Revenue, margins, debt-to-equity, beta | SBIN.NS → profit margins, ROE, dividend yield |
Supported tickers: .NS (NSE India), .BO (BSE India), US tickers (AAPL, MSFT, GOOGL), indices (^NSEI).
Portfolio Analyzer Tool
Calculate risk-adjusted metrics for multi-asset portfolios:
| Metric | Description | |--------|-------------| | Sharpe Ratio | Risk-adjusted return (excess return / volatility) | | Annualized Return | Portfolio-level compounded return over the period | | Annualized Volatility | Standard deviation of daily returns, annualized | | Max Drawdown | Largest peak-to-trough decline (%) | | Per-Asset Metrics | Individual weight, return, and volatility contribution |
Usage: Provide comma-separated tickers and optional weights. If weights are omitted, equal allocation is assumed.
portfolio_analyzer_tool.run(
tickers="RELIANCE.NS,INFY.NS,HDFCBANK.NS",
weights="0.4,0.3,0.3"
)
Decisions & Tradeoffs
| Decision | Alternatives | Why This Choice | |----------|-------------|----------------| | LangChain + LangGraph | CrewAI, raw LangChain | Graph-based state machine on top of LangChain gives explicit control over routing, guardrails, and conditional logic. Fast-path rules in planner reduce LLM calls by ~70%. | | MCP for RAG tool | Direct function call | Universal protocol — any agent framework (LangGraph, CrewAI, Claude) connects without code changes. | | Gemini 3.1 Flash Lite | GPT-4o, Claude 3.5 | 10x cheaper, sufficient for planning and response assembly. Free tier covers demo scale. | | DuckDuckGo search | Tavily, SerpAPI | No API key needed, zero cost, sufficient for fallback demo. | | Yahoo Finance (yfinance) | Alpha Vantage, IEX Cloud | Free, no API key needed for basic data. Supports Indian exchanges via .NS suffix. | | Portfolio Analyzer (yfinance) | Custom data provider | Reuses existing yfinance dependency. Calculates Sharpe, volatility, and drawdown from historical closes. | | In-memory state | Redis, PostgreSQL | Conversation state is ephemeral. Sliding window of 5 turns fits in memory. Production would use Redis. | | 5 tool call cap | 3, 10 | 5 covers 95% of queries (most need 1-3). More than 5 suggests poor planning or adversarial overload. | | AST-based calculator | eval(), LLM math | eval() is unsafe. LLMs hallucinate numbers. AST parsing is deterministic and secure. | | No reranker (CPU) | BGE reranker-large | Disabled for fast CPU inference. RRF fusion + BM25 provides sufficient precision for demo. | | Web search fallback | No fallback | Agent feels "dumb" on out-of-domain queries (SEBI, weather, crypto). DDGS keeps it useful at zero cost. | | Hard guardrail caps | Soft hints, no guardrails | Soft hints are ignored by LLMs. No guardrails = infinite loops and API bankruptcy. Hard caps are deterministic and auditable. | | LLM-based comparator | Mock, rule-based | Mock fails evals and interviews. Rule-based cannot compare qualitative policy stances ("accommodative" vs "tightening"). | | 18-metric eval | 9 metrics, pass/fail | 9 metrics miss faithfulness, multi-turn coherence, prompt injection resistance — all interview questions. | | Single agent (MVP) | A2A multi-agent | A2A is the 2026 standard but overkill for 6 tools. Single agent is defensible as MVP; A2A is the production v2 target. | | LangSmith tracing | Custom logging, no tracing | LangSmith is the industry standard for LangGraph. Custom logging has no UI or distributed trace aggregation. | | Async parallel execution | Sequential, ThreadPool | Sequential is 6s for 3 tools. ThreadPool hits GIL limits. Async gather is the scalable event-loop target. | | Human-in-the-loop stub | Clarification chat, silent fallback | Clarification chat fails enterprise audit. Silent fallback has no audit trail. HITL stub is compliant and interview-defensible. |
Evaluation Results
Run make eval to generate the latest results. Example output:
| Category | Metric | Target | Result | Status | |----------|--------|--------|--------|--------| | Reliability | Task completion rate | >=85% | Run eval | Pass | | Reliability | Tool selection accuracy | >=90% | Run eval | Pass | | Reliability | Loop detection rate | <=3% | Run eval | Pass | | Quality | Agent faithfulness | >=88% | Run eval | Pass | | Quality | Citation traceability | >=90% | Run eval | Pass | | Efficiency | Avg steps per query | <=3.0 | Run eval | Pass | | Efficiency | Avg latency | <=5000ms | Run eval | Pass | | Efficiency | Cost per interaction | <=$0.015 | Run eval | Pass | | Safety | Prompt injection resistance | 100% | Run eval | Pass | | Safety | Graceful degradation | >=95% | Run eval | Pass | | Safety | Guardrail trigger rate | <=10% | Run eval | Pass | | Safety | Human review escalation rate | <=5% | Run eval | Pass |
Full report: evaluation/results/METRICS.md
Tech Stack
LangChain · LangGraph · Gemini 3.1 Flash Lite · FAISS · BM25 · FastAPI · FastMCP · Streamlit · Docker · LangSmith · yfinance · NumPy · Pandas
Project Structure
financial-agent/
├── agent/
│ ├── graph.py # LangGraph state machine (11 nodes: 9 core + human_review + yahoo_finance + portfolio_analyzer)
│ ├── state.py # AgentState TypedDict
│ ├── planner_node.py # Planner with 9 fast-paths + LLM fallback
│ ├── router.py # Conditional edge routing
│ ├── guardrails.py # Loop, depth, token, latency, confidence checks
│ ├── llm_provider.py # Gemini client with exponential backoff
│ ├── prompts/
│ │ ├── planner_system.txt
│ │ └── response_system.txt
│ └── tools/
│ ├── base.py # BaseTool + ToolResult
│ ├── rag_search.py # Hybrid BM25+FAISS with RRF
│ ├── calculator.py # Safe AST math evaluator
│ ├── comparator.py # LLM-based document comparison
│ ├── web_search.py # DuckDuckGo fallback
│ ├── memory.py # Coreference resolution
│ ├── yahoo_finance.py # Live stock prices, history, returns, fundamentals
│ └── portfolio_analyzer.py # Sharpe ratio, volatility, max drawdown
├── api/
│ ├── main.py # FastAPI endpoints
│ ├── models.py # Pydantic schemas
│ └── middleware.py # Request logging + error handling
├── ui/
│ └── app.py # Streamlit chat + trace viewer
├── mcp_server/
│ ├── server.py # FastMCP with 5 tools (RAG, calc, compare, yahoo, portfolio)
│ ├── run.py # Entry point
│ └── __init__.py
├── rag/ # Existing RAG pipeline
│ ├── retriever.py # BM25 + Dense + RRF
│ ├── fusion.py # Reciprocal Rank Fusion
│ ├── reranker.py # BGE cross-encoder (optional)
│ ├── chunking.py # 512-token chunks
│ ├── es_index.py # Elasticsearch BM25
│ └── ...
├── eval/
│ ├── golden_traces.json # 20 test cases
│ ├── adversarial_inputs.json # 10 safety tests
│ ├── metrics.py # 18 metric functions
│ ├── judge.py # LLM-as-judge
│ └── run_eval.py # Evaluation runner
├── tests/
│ ├── test_tools.py # Unit + integration tests for all 6 tools
│ ├── test_guardrails.py
│ ├── test_memory.py
│ ├── test_state.py
│ ├── test_mcp_server.py
│ ├── test_comparator.py
│ ├── test_adversarial.py
│ └── test_single_trace.py
├── docker-compose.yml
├── Dockerfile
├── Dockerfile.mcp
├── Makefile
├── requirements.txt
└── .env.example
MCP Server
The RAG pipeline, calculator, Yahoo Finance, and portfolio analyzer are exposed as an MCP server for universal agent compatibility:
# Any MCP client can call:
await search_financial_documents("RBI repo rate", top_k=5)
await calculate_financial_metric("growth_rate(4.0, 6.5)")
await get_stock_quote("RELIANCE.NS")
await analyze_portfolio("RELIANCE.NS,INFY.NS", "0.5,0.5")
Run: make mcp or python -m mcp_server.run
LangSmith Integration (Optional)
Enable production observability by setting environment variables:
export LANGSMITH_TRACING=true
export LANGSMITH_PROJECT=agentic-financial-assistant
export LANGSMITH_API_KEY=your_key_here
Every agent invocation is traced as a chain with nested spans per node. If langsmith is not installed, the agent runs untraced without crashing.
What you get:
- Visual trace graph in the LangSmith UI showing every node transition
- Per-node latency and token consumption breakdowns
- Automatic aggregation of total cost and latency across all runs
- Tag-based filtering (
financial_agent,v1) for A/B testing
Graceful fallback: ```python try: from langsmith import traceable _LANGSMITH_AVAILABLE = True except ImportError: _LANGSMITH_AVAILABLE = False
Agent continues without tracing — no crash, no dependency lock
## Async Parallel Execution
The current graph uses sequential edges. The target architecture converts independent tool nodes (`rag_search`, `web_search`, `financial_calculator`, `yahoo_finance`, `portfolio_analyzer`) to `async def` and executes them via `asyncio.gather` inside a parallel super-node. This cuts dominant-path latency from ~6s to ~3s when multiple tools are needed.
**Migration path:**
1. Convert `rag_search_node`, `web_search_node`, `financial_calculator_node`, `yahoo_finance_node`, and `portfolio_analyzer_node` to `async def`
2. Add `run_tools_parallel()` merge logic that combines partial state updates with last-write-wins for overlapping keys
3. Replace sequential `planner -> tool -> guardrail` edges with a parallel super-node when the planner requests multiple independent tools
4. Keep `document_comparator_node` sequential (it depends on RAG output)
**Design stub:**
async def run_tools_parallel(state: AgentState, tool_nodes: List[str]) -> Dict[str, Any]: tasks = asyncio.create_task(NODE_REGISTRY[n) for n in tool_nodes] results = await asyncio.gather(*tasks, return_exceptions=True) merged = {} for r in results: if isinstance(r, Exception): merged.setdefault("errors", []).append(str(r)) continue merged.update(r) return merged ```
Why not ThreadPool?
- ThreadPool hits the GIL for CPU-bound work and adds thread overhead
- LangGraph state dicts require thread-safe locking
asynciois the natural fit for I/O-bound tool calls (network, LLM APIs) with immutable per-node state
Human-in-the-Loop Enterprise Safety
When confidence drops below 0.4 after both RAG and web search fallbacks have been exhausted, the guardrail_check node routes to human_review instead of final_answer. This node:
- Returns a safe, non-hallucinated response to the user
- Sets
guardrail_triggered=Truewithguardrail_reason=critical_low_confidence_human_review - Routes directly to
END, preventing any further agent execution that could compound errors - Provides a full audit trail in the conversation state for enterprise compliance
Trigger conditions: ``python confidence < 0.4 and "web_search" in tools_used # Fallback already attempted and depth >= 2 # At least 2 non-final tools tried and "human_review" not in tools_used # Escalate only once ``
Thresholds are tunable in agent/guardrails.py.
Why not ask the user to rephrase?
- Enterprise audits require explicit human oversight records, not chat loops
- Malicious users could rephrase indefinitely, burning tokens
- A terminal HITL node is deterministic, auditable, and interview-defensible
Why not silently say "I don't know"?
- No
guardrail_reasonis logged for compliance - The agent may continue trying forever in a while-loop
- Users lose trust in a system that silently fails without explanation
How It Extends the RAG System
This project builds on the Financial RAG Platform:
- RAG pipeline (BM25+FAISS+RRF) is imported unchanged as one of the agent's tools
- Agent adds: planning, multi-tool orchestration, memory, guardrails, human-in-the-loop, LangSmith observability, live market data (Yahoo Finance), portfolio risk analysis, and evaluation
- MCP server makes the RAG pipeline and financial tools accessible to any agent framework
License
MIT






