Agentic RAG

Overview

Agentic RAG elevates standard RAG by introducing autonomous LLM agents into the retrieval pipeline. Instead of a fixed “retrieve → augment → generate” sequence, an agentic system can reason about what to retrieve, decide when to retrieve, evaluate retrieval quality, and iteratively refine its approach until it has sufficient context to answer.

The core shift is from passive retrieval (blindly fetch top-k documents) to active retrieval (strategically gather information based on emerging needs). This transforms RAG from a static pipeline into a dynamic, goal-oriented process.

Core Concept

Why Standard RAG can fail for Complex Tasks

Naive RAG makes several assumptions that may break down in practice:

Assumption	Reality
Top-k semantic similarity = relevance	Semantic similarity $\neq =$ ground truth relevance
Single retrieval is enough	Complex questions require Multi-hop Reasoning
Retrieved docs are useful	Some retrievals add noise, not signal
Query as-is is optimal	Original query often needs transformation

Agentic RAG addresses these failures by giving the LLM agency over the retrieval process itself.

The Agent Loop

The fundamental pattern in Agentic RAG is an iterative agent loop:

While (not confident_enough or has_info_gaps):
    1. REASON: What do I know? What's missing?
    2. PLAN: What tool/query should I use next?
    3. ACT: Execute retrieval, web search, API call, etc.
    4. OBSERVE: Evaluate the results
    5. REFLECT: Is this relevant? Do I need more?

Final: SYNTHESIZE and generate response

This loop enables the system to:

Skip retrieval when LLM parametric knowledge suffices
Perform multiple retrievals when one is not enough
Discard irrelevant results and try alternative queries
Route queries to different knowledge sources based on intent

Key Agentic Capabilities

1. Adaptive Retrieval (When to Retrieve)

Unlike standard RAG which always retrieves, agentic systems decide dynamically:

Skip retrieval: For factual questions within LLM training data
Retrieve once: For straightforward knowledge-base queries
Retrieve multiple times: For multi-hop questions
Trigger web search: When internal knowledge is stale or missing

Query: "What is 2 + 2?"
Agent: [No retrieval needed - parametric knowledge sufficient]

Query: "What are our company's Q4 revenue targets?"
Agent: [Retrieve from internal docs - proprietary info]

Query: "Who founded the company that acquired Twitter?"
Agent: [Multi-hop retrieval needed]
    Hop 1: Retrieve "Twitter acquisition"
    Hop 2: Retrieve "[acquirer] founders"

2. Query Planning and Transformation

Agents can decompose complex queries or rewrite them for better retrieval:

Query Decomposition: Break “Compare revenue growth of Apple and Microsoft” into parallel sub-queries
Query Expansion: Add synonyms or related terms to improve recall
Query Refinement: After initial retrieval, generate more specific follow-up queries
HyDE: Generate hypothetical answer to embed as query

3. Self-Reflection and Critique

The agent evaluates its own process and outputs:

Relevance Check: “Is this retrieved document actually relevant?”
Sufficiency Check: “Do I have enough information to answer?”
Consistency Check: “Do retrieved documents contradict each other?”
Faithfulness Check: “Is my generated answer grounded in the context?“

4. Tool Use

Agents orchestrate multiple tools beyond vector retrieval:

Tool Type	Examples
Vector Search	Query embeddings against Vector Databases
Keyword Search	BM25, Elasticsearch for exact matches
Web Search	Google for real-time information
SQL/API	Database queries, REST API calls
Code Execution	Run calculations, data transformations
Knowledge Graph	GraphRAG traversal for entity relationships

5. Memory

Agents maintain state across interactions:

Short-term: Conversation history, retrieved contexts in current session
Long-term: Learned preferences, frequently accessed documents, user profile

Architecture Patterns

Agentic Router

A central LLM agent routes queries to appropriate tools or knowledge bases based on intent classification.

When to Use:

Multiple knowledge sources with distinct domains
Clear routing signals in query (e.g., “search the web for…” vs “check our docs for…“)
Cost optimization (avoid unnecessary retrievals)

Example Routing Logic:

def route_query(query: str) -> str:
    # LLM classifies intent
    intent = classify_intent(query)
    
    if intent == "factual_lookup":
        return vector_rag(query)
    elif intent == "real_time_info":
        return web_search(query)
    elif intent == "structured_data":
        return sql_query(query)
    elif intent == "no_retrieval_needed":
        return llm_direct(query)

Self-RAG (Self-Reflective RAG)

Must read: https://selfrag.github.io/

The model learns to retrieve, generate, and critique through special reflection tokens. This approach requires fine-tuning the LLM itself.

Token	Purpose
`[Retrieve]` / `[No Retrieve]`	Should I fetch external knowledge?
`[ISREL]`	Is this passage relevant to the query?
`[ISSUP]`	Is my generation supported by the evidence?
`[ISUSE]`	Is this output useful to the user?

Self-RAG Flow:

Query → LLM decides: [Retrieve] or [No Retrieve]?

If [Retrieve]:
  → Fetch passages
  → For each passage: LLM generates [ISREL] score
  → Filter irrelevant passages
  → Generate response with remaining context
  → LLM generates [ISSUP] score (faithfulness)
  → LLM generates [ISUSE] score (utility)
  → Select best response or iterate

Self-RAG internalizes the retrieval decision and quality evaluation into the model itself, rather than relying on external heuristics.

Trade-offs:

Requires fine-tuning (not just prompting)
More computationally expensive at training time
Offers finer-grained control at inference time
Outperforms standard RAG on factuality benchmarks

https://www.blog.langchain.com/agentic-rag-with-langgraph/ - for implementation

Corrective RAG (CRAG)

Uses a lightweight retrieval evaluator to assess document quality and trigger corrective actions.

Confidence Classifications:

Correct: High relevance → Extract and refine key knowledge strips
Ambiguous: Uncertain → Use both retrieved docs and web search
Incorrect: Low relevance → Discard and fall back to web search

Knowledge Refinement (for “Correct” docs):

Decompose document into knowledge strips (sentence-level)
Score relevance of each strip
Filter irrelevant strips
Recompose into focused internal knowledge

Multi-Agent RAG

Distributes the RAG workflow across specialized agents that collaborate on complex tasks.

                    ┌─────────────────┐
                    │   User Query    │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │  Planner Agent  │
                    │(decompose task) │
                    └────────┬────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
         ▼                   ▼                   ▼
   ┌───────────┐       ┌───────────┐       ┌───────────┐
   │ Retriever │       │ Retriever │       │ Retriever │
   │  Agent 1  │       │  Agent 2  │       │  Agent 3  │
   └─────┬─────┘       └─────┬─────┘       └─────┬─────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             │
                    ┌────────▼────────┐
                    │  Critic Agent   │
                    │ (evaluate/rank) │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │  Writer Agent   │
                    │  (synthesize)   │
                    └─────────────────┘

Agent	Responsibility
Planner	Decompose query into sub-tasks, assign to retrievers
Retriever(s)	Execute searches against different sources/queries
Extractor	Parse and summarize retrieved documents
Critic	Evaluate quality, identify gaps, request re-retrieval
Writer	Synthesize final response from verified information

When to Use:

Complex research tasks requiring multiple sources
High-stakes applications requiring quality gates
Workflows benefiting from specialization

Comparison of Patterns

Pattern	Complexity	Latency	Cost	Best For
Router	Low	Low	Low	Multi-source routing
Self-RAG	High (training)	Medium	Medium	Factuality-critical apps
CRAG	Medium	Medium	Medium	Quality-gated retrieval
Multi-Agent	High	High	High	Complex research tasks

Production Considerations

Latency and Cost

Agentic RAG introduces additional latency and cost compared to Naive RAG:

Component	Naive RAG	Agentic RAG
Retrieval Calls	1	1-5+ (depends on complexity)
LLM Calls	1	2-10+ (reasoning steps)
Typical Latency	200-400ms	500ms-3s+
Cost per Query	Baseline	2-10x baseline

Mitigation Strategies:

Set maximum iteration limits
Use cheaper models for routing/evaluation
Cache intermediate results
Parallelize independent retrievals
Implement early stopping when confidence is high

Use Agentic RAG when:

Questions frequently require multi-step reasoning
Users expect conversational, exploratory interactions
Multiple heterogeneous knowledge sources exist
Retrieval quality varies and requires evaluation
Domain requires high factuality with citations

Don’t use Agentic RAG when:

Queries are simple, single-hop factual lookups
Latency SLAs are strict (<500ms)
Cost is a primary constraint
Your retrieval system is already high quality
You don’t have the infrastructure for agentic orchestration

Common Pitfalls

1. Infinite Loops Agent keeps retrieving without converging on an answer.

Fix: Set max_iterations, implement confidence thresholds

2. Over-Retrieval Agent retrieves for simple questions that don’t need it.

Fix: Train/prompt for “no retrieval” decisions

Observability

For production agentic RAG, track:

Iterations per query: Distribution of hop counts
Tool usage patterns: Which tools are used most?
Latency breakdown: Time per reasoning step, per retrieval
Retrieval relevance per hop: Does quality degrade over iterations?
Agent decisions: Log [Retrieve]/[No Retrieve] decisions
Fallback rate: How often does the agent loop terminate early?

Implementation Frameworks

Framework	Agentic RAG Support
LangChain	Agents, tools, ReAct implementation
LlamaIndex	QueryEngine agents, SubQuestionQueryEngine
CrewAI	Multi-agent orchestration
AutoGen	Conversational multi-agent patterns
Semantic Kernel	Planners and plugins for agentic flows

Resources

Papers

Articles

Back to: 01 - RAG Index

Aayush's ML & AI Notes

Explorer