Overview

Naive RAG represents the standard, baseline architecture for Retrieval-Augmented Generation. It follows a strictly linear process:

  1. Indexing documents into vector embeddings
  2. Retrieving the top- most similar chunks based on a user query
  3. Feeding them directly into the LLM context window

It is “Naive” because it assumes that semantic similarity equals ground truth relevance, which is not always true. It forms the foundation upon which all advanced RAG techniques (Hybrid Search, Re-ranking, Agentic RAG) are built.

The approach makes several simplifying assumptions that break down in real-world scenarios:

  • Assumption 1: The top- semantically similar chunks contain the answer. Reality: Semantic similarity relevance to the question.
  • Assumption 2: Chunks are self-contained. Reality: Information often spans multiple chunks or requires context from surrounding text.
  • Assumption 3: A single retrieval pass is sufficient. Reality: Complex questions require multi-hop reasoning across multiple retrievals.

Linear Pipeline

graph TD
    subgraph Indexing Phase
    Docs[Raw Documents] --> Loader[Document Loader]
    Loader --> Splitter[Text Splitter]
    Splitter --> Chunks[Text Chunks]
    Chunks --> Embed[Embedding Model]
    Embed --> Vectors[Vector Embeddings]
    Vectors --> DB[(Vector Store)]
    end

    subgraph Retrieval Phase
    Query[User Query] --> EmbedQuery[Embedding Model]
    EmbedQuery --> QueryVec[Query Vector]
    QueryVec --> Search[Similarity Search]
    DB --> Search
    Search -->|Top-K Chunks| Context[Retrieved Context]
    end

    subgraph Generation Phase
    Context --> Prompt[Prompt Template]
    Query --> Prompt
    Prompt --> LLM[LLM]
    LLM --> Answer[Generated Answer]
    end

The entire pipeline is deterministic except for the final LLM generation. Given the same query and the same indexed documents, you will always retrieve the same chunks.

Pipeline Steps

Step 1: Document Loading & Parsing

Ingest raw data into a processable format.

  • Convert various file formats into plain text
  • Extract metadata (filename, page numbers, timestamps, authors)
  • Handle special content: tables, images (via OCR or multimodal models), code blocks

TODO( Document Parsing ) for techniques like LlamaParse and Docling.

Step 2: Chunking (Text Splitting)

Breaking documents into smaller pieces. Poor chunking is one of the most common causes of RAG failure.

Why chunk?

  1. Context window limits: LLMs have finite input sizes
  2. Retrieval precision
  3. Cost: More tokens = more money (retrieving entire documents is wasteful)

The Overlap Problem: Chunks should overlap to avoid losing context at boundaries. Typical overlap: 10-20% of chunk size.

Chunk 1: [----content A----][overlap]
Chunk 2:                    [overlap][----content B----]

Without overlap, a question about content spanning the boundary would fail to retrieve either chunk with high confidence.

See Chunking Strategies for detailed comparisons.

Step 3: Embedding

We convert text chunks into vector embeddings. See Embeddings for details.

Step 4: Vector Storage & Indexing

Embeddings are stored in a Vector Database optimized for similarity search at scale.

Challenge: Given a query vector , find the most similar vectors among potentially millions of stored vectors. This is done using Approximate Nearest Neighbor (ANN) Algorithms.

See Vector Databases for details.

Metadata Storage: Store metadata alongside vectors for filtering: This enables queries like: “Find similar chunks from documents published after 2023”.

Step 5: Retrieval

When a user submits a query, we find the most relevant chunks.

  1. Embed the query:
  2. Search the index: Find top- vectors closest to
  3. Return chunks: Fetch the original text associated with each vector
valueProsCons
SmallPrecise, less noise, cheaperMay miss relevant information
LargeHigher recall, more contextMore noise, Lost in the Middle effect, expensive

Ensure to pre-filter by metadata before similarity search. (To guarantee k closest vectors are metadata-relevant)

Step 6: Generation

The retrieved chunks are assembled into a prompt and sent to the LLM.

A sample RAG Prompt Template:

You are a helpful assistant. Answer the user's question based ONLY on
the following context. If the answer cannot be found in the context,
say "I don't have enough information to answer this question."

Context:
{retrieved_chunks}

Question: {user_query}

Answer:

Prompt Considerations:

  • Grounding instruction - “Answer ONLY based on the context” reduces hallucination
  • Uncertainty handling: - Telling the model what to do when context is insufficient
  • Request citations - “Cite the source document for each claim” improves verifiability

Context Window Management: If retrieved chunks exceed the context window:

  1. Summarize: Compress chunks before insertion
  2. Rerank and select: Use a Re-ranking model to keep only the most relevant

Practical Application

Naive RAG is sufficient for

  • Prototyping: Quick POC to validate feasibility
  • Simple Q&A: Single-hop questions with answers contained in one chunk
  • Well-structured data: Clean documents with clear topic boundaries
  • Limited budget

When NOT to Use Naive RAG

  • Multi-hop reasoning: Questions requiring synthesis across multiple documents (use Multi-hop Reasoning)
  • High-precision domains: Legal, medical, financial where errors are costly (add Re-ranking)
  • Keyword-sensitive queries: Specific IDs, codes, names (add Hybrid Search with BM25)
  • Large-scale production: When retrieval quality directly impacts business metrics

Latency & Cost

Latency: Query Embedding and Vector search are generally much faster compared to LLM generation in these pipelines.

Cost Drivers:

  1. Embedding API calls: Per-token pricing
  2. Vector DB hosting: Storage + queries
  3. LLM tokens: Input (retrieved chunks) + output (generated answer)

Potential Optimizations:

  • Cache frequently-asked query embeddings
  • Compress/summarize chunks to reduce LLM input tokens
  • Batch embedding calls during indexing

Comparisons

Why “Naive” Falls Short

ProblemDescriptionAdvanced Solution
Low PrecisionRetrieved chunks might be semantically similar but not actually relevant to the questionRe-ranking with cross-encoders
Low RecallAnswer spans multiple chunks or uses different terminologyHybrid Search, Query Transformations
Lost in the MiddleLLMs tend to ignore information in the middle of long contextsReorder chunks (important first/last), reduce
Ambiguity”Apple” the fruit vs. the company; “Python” the language vs. the snakeMetadata filtering, Contextual Retrieval
Multi-hop FailureQuestions like “Who founded the company that acquired X?” require chained lookupsMulti-hop Reasoning, Agentic RAG
Stale DataVector DB contains outdated informationIndex refresh pipelines, timestamp filtering

Resources

Tools & Libraries


Back to: 01 - RAG Index