Overview

Agentic RAG elevates standard RAG by introducing autonomous LLM agents into the retrieval pipeline. Instead of a fixed “retrieve → augment → generate” sequence, an agentic system can reason about what to retrieve, decide when to retrieve, evaluate retrieval quality, and iteratively refine its approach until it has sufficient context to answer.

The core shift is from passive retrieval (blindly fetch top-k documents) to active retrieval (strategically gather information based on emerging needs). This transforms RAG from a static pipeline into a dynamic, goal-oriented process.

Core Concept

Why Standard RAG can fail for Complex Tasks

Naive RAG makes several assumptions that may break down in practice:

AssumptionReality
Top-k semantic similarity = relevanceSemantic similarity ground truth relevance
Single retrieval is enoughComplex questions require Multi-hop Reasoning
Retrieved docs are usefulSome retrievals add noise, not signal
Query as-is is optimalOriginal query often needs transformation

Agentic RAG addresses these failures by giving the LLM agency over the retrieval process itself.

The Agent Loop

The fundamental pattern in Agentic RAG is an iterative agent loop:

While (not confident_enough or has_info_gaps):
    1. REASON: What do I know? What's missing?
    2. PLAN: What tool/query should I use next?
    3. ACT: Execute retrieval, web search, API call, etc.
    4. OBSERVE: Evaluate the results
    5. REFLECT: Is this relevant? Do I need more?

Final: SYNTHESIZE and generate response

This loop enables the system to:

  • Skip retrieval when LLM parametric knowledge suffices
  • Perform multiple retrievals when one is not enough
  • Discard irrelevant results and try alternative queries
  • Route queries to different knowledge sources based on intent

Key Agentic Capabilities

1. Adaptive Retrieval (When to Retrieve)

Unlike standard RAG which always retrieves, agentic systems decide dynamically:

  • Skip retrieval: For factual questions within LLM training data
  • Retrieve once: For straightforward knowledge-base queries
  • Retrieve multiple times: For multi-hop questions
  • Trigger web search: When internal knowledge is stale or missing
Query: "What is 2 + 2?"
Agent: [No retrieval needed - parametric knowledge sufficient]

Query: "What are our company's Q4 revenue targets?"
Agent: [Retrieve from internal docs - proprietary info]

Query: "Who founded the company that acquired Twitter?"
Agent: [Multi-hop retrieval needed]
    Hop 1: Retrieve "Twitter acquisition"
    Hop 2: Retrieve "[acquirer] founders"

2. Query Planning and Transformation

Agents can decompose complex queries or rewrite them for better retrieval:

  • Query Decomposition: Break “Compare revenue growth of Apple and Microsoft” into parallel sub-queries
  • Query Expansion: Add synonyms or related terms to improve recall
  • Query Refinement: After initial retrieval, generate more specific follow-up queries
  • HyDE: Generate hypothetical answer to embed as query

3. Self-Reflection and Critique

The agent evaluates its own process and outputs:

  • Relevance Check: “Is this retrieved document actually relevant?”
  • Sufficiency Check: “Do I have enough information to answer?”
  • Consistency Check: “Do retrieved documents contradict each other?”
  • Faithfulness Check: “Is my generated answer grounded in the context?“

4. Tool Use

Agents orchestrate multiple tools beyond vector retrieval:

Tool TypeExamples
Vector SearchQuery embeddings against Vector Databases
Keyword SearchBM25, Elasticsearch for exact matches
Web SearchGoogle for real-time information
SQL/APIDatabase queries, REST API calls
Code ExecutionRun calculations, data transformations
Knowledge GraphGraphRAG traversal for entity relationships

5. Memory

Agents maintain state across interactions:

  • Short-term: Conversation history, retrieved contexts in current session
  • Long-term: Learned preferences, frequently accessed documents, user profile

Architecture Patterns

Agentic Router

A central LLM agent routes queries to appropriate tools or knowledge bases based on intent classification.

When to Use:

  • Multiple knowledge sources with distinct domains
  • Clear routing signals in query (e.g., “search the web for…” vs “check our docs for…“)
  • Cost optimization (avoid unnecessary retrievals)

Example Routing Logic:

def route_query(query: str) -> str:
    # LLM classifies intent
    intent = classify_intent(query)
    
    if intent == "factual_lookup":
        return vector_rag(query)
    elif intent == "real_time_info":
        return web_search(query)
    elif intent == "structured_data":
        return sql_query(query)
    elif intent == "no_retrieval_needed":
        return llm_direct(query)

Self-RAG (Self-Reflective RAG)

Must read: https://selfrag.github.io/

The model learns to retrieve, generate, and critique through special reflection tokens. This approach requires fine-tuning the LLM itself.

TokenPurpose
[Retrieve] / [No Retrieve]Should I fetch external knowledge?
[ISREL]Is this passage relevant to the query?
[ISSUP]Is my generation supported by the evidence?
[ISUSE]Is this output useful to the user?

Self-RAG Flow:

Query → LLM decides: [Retrieve] or [No Retrieve]?

If [Retrieve]:
  → Fetch passages
  → For each passage: LLM generates [ISREL] score
  → Filter irrelevant passages
  → Generate response with remaining context
  → LLM generates [ISSUP] score (faithfulness)
  → LLM generates [ISUSE] score (utility)
  → Select best response or iterate

Self-RAG internalizes the retrieval decision and quality evaluation into the model itself, rather than relying on external heuristics.

Trade-offs:

  • Requires fine-tuning (not just prompting)
  • More computationally expensive at training time
  • Offers finer-grained control at inference time
  • Outperforms standard RAG on factuality benchmarks

https://www.blog.langchain.com/agentic-rag-with-langgraph/ - for implementation

Corrective RAG (CRAG)

Uses a lightweight retrieval evaluator to assess document quality and trigger corrective actions.

Confidence Classifications:

  • Correct: High relevance → Extract and refine key knowledge strips
  • Ambiguous: Uncertain → Use both retrieved docs and web search
  • Incorrect: Low relevance → Discard and fall back to web search

Knowledge Refinement (for “Correct” docs):

  1. Decompose document into knowledge strips (sentence-level)
  2. Score relevance of each strip
  3. Filter irrelevant strips
  4. Recompose into focused internal knowledge

Multi-Agent RAG

Distributes the RAG workflow across specialized agents that collaborate on complex tasks.

                    ┌─────────────────┐
                    │   User Query    │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │  Planner Agent  │
                    │(decompose task) │
                    └────────┬────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
         ▼                   ▼                   ▼
   ┌───────────┐       ┌───────────┐       ┌───────────┐
   │ Retriever │       │ Retriever │       │ Retriever │
   │  Agent 1  │       │  Agent 2  │       │  Agent 3  │
   └─────┬─────┘       └─────┬─────┘       └─────┬─────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             │
                    ┌────────▼────────┐
                    │  Critic Agent   │
                    │ (evaluate/rank) │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │  Writer Agent   │
                    │  (synthesize)   │
                    └─────────────────┘
AgentResponsibility
PlannerDecompose query into sub-tasks, assign to retrievers
Retriever(s)Execute searches against different sources/queries
ExtractorParse and summarize retrieved documents
CriticEvaluate quality, identify gaps, request re-retrieval
WriterSynthesize final response from verified information

When to Use:

  • Complex research tasks requiring multiple sources
  • High-stakes applications requiring quality gates
  • Workflows benefiting from specialization

Comparison of Patterns

PatternComplexityLatencyCostBest For
RouterLowLowLowMulti-source routing
Self-RAGHigh (training)MediumMediumFactuality-critical apps
CRAGMediumMediumMediumQuality-gated retrieval
Multi-AgentHighHighHighComplex research tasks

Production Considerations

Latency and Cost

Agentic RAG introduces additional latency and cost compared to Naive RAG:

ComponentNaive RAGAgentic RAG
Retrieval Calls11-5+ (depends on complexity)
LLM Calls12-10+ (reasoning steps)
Typical Latency200-400ms500ms-3s+
Cost per QueryBaseline2-10x baseline

Mitigation Strategies:

  • Set maximum iteration limits
  • Use cheaper models for routing/evaluation
  • Cache intermediate results
  • Parallelize independent retrievals
  • Implement early stopping when confidence is high

Use Agentic RAG when:

  • Questions frequently require multi-step reasoning
  • Users expect conversational, exploratory interactions
  • Multiple heterogeneous knowledge sources exist
  • Retrieval quality varies and requires evaluation
  • Domain requires high factuality with citations

Don’t use Agentic RAG when:

  • Queries are simple, single-hop factual lookups
  • Latency SLAs are strict (<500ms)
  • Cost is a primary constraint
  • Your retrieval system is already high quality
  • You don’t have the infrastructure for agentic orchestration

Common Pitfalls

1. Infinite Loops Agent keeps retrieving without converging on an answer.

  • Fix: Set max_iterations, implement confidence thresholds

2. Over-Retrieval Agent retrieves for simple questions that don’t need it.

  • Fix: Train/prompt for “no retrieval” decisions

Observability

For production agentic RAG, track:

  • Iterations per query: Distribution of hop counts
  • Tool usage patterns: Which tools are used most?
  • Latency breakdown: Time per reasoning step, per retrieval
  • Retrieval relevance per hop: Does quality degrade over iterations?
  • Agent decisions: Log [Retrieve]/[No Retrieve] decisions
  • Fallback rate: How often does the agent loop terminate early?

Implementation Frameworks

FrameworkAgentic RAG Support
LangChainAgents, tools, ReAct implementation
LlamaIndexQueryEngine agents, SubQuestionQueryEngine
CrewAIMulti-agent orchestration
AutoGenConversational multi-agent patterns
Semantic KernelPlanners and plugins for agentic flows

Resources

Papers

Articles


Back to: 01 - RAG Index