Naive RAG Pipeline

Overview

Naive RAG represents the standard, baseline architecture for Retrieval-Augmented Generation. It follows a strictly linear process:

Indexing documents into vector embeddings
Retrieving the top- $k$ most similar chunks based on a user query
Feeding them directly into the LLM context window

It is “Naive” because it assumes that semantic similarity equals ground truth relevance, which is not always true. It forms the foundation upon which all advanced RAG techniques (Hybrid Search, Re-ranking, Agentic RAG) are built.

The approach makes several simplifying assumptions that break down in real-world scenarios:

Assumption 1: The top- $k$ semantically similar chunks contain the answer. Reality: Semantic similarity $\neq =$ relevance to the question.
Assumption 2: Chunks are self-contained. Reality: Information often spans multiple chunks or requires context from surrounding text.
Assumption 3: A single retrieval pass is sufficient. Reality: Complex questions require multi-hop reasoning across multiple retrievals.

Linear Pipeline

graph TD
    subgraph Indexing Phase
    Docs[Raw Documents] --> Loader[Document Loader]
    Loader --> Splitter[Text Splitter]
    Splitter --> Chunks[Text Chunks]
    Chunks --> Embed[Embedding Model]
    Embed --> Vectors[Vector Embeddings]
    Vectors --> DB[(Vector Store)]
    end

    subgraph Retrieval Phase
    Query[User Query] --> EmbedQuery[Embedding Model]
    EmbedQuery --> QueryVec[Query Vector]
    QueryVec --> Search[Similarity Search]
    DB --> Search
    Search -->|Top-K Chunks| Context[Retrieved Context]
    end

    subgraph Generation Phase
    Context --> Prompt[Prompt Template]
    Query --> Prompt
    Prompt --> LLM[LLM]
    LLM --> Answer[Generated Answer]
    end

The entire pipeline is deterministic except for the final LLM generation. Given the same query and the same indexed documents, you will always retrieve the same chunks.

Pipeline Steps

Step 1: Document Loading & Parsing

Ingest raw data into a processable format.

Convert various file formats into plain text
Extract metadata (filename, page numbers, timestamps, authors)
Handle special content: tables, images (via OCR or multimodal models), code blocks

TODO( Document Parsing ) for techniques like LlamaParse and Docling.

Step 2: Chunking (Text Splitting)

Breaking documents into smaller pieces. Poor chunking is one of the most common causes of RAG failure.

Why chunk?

Context window limits: LLMs have finite input sizes
Retrieval precision
Cost: More tokens = more money (retrieving entire documents is wasteful)

The Overlap Problem: Chunks should overlap to avoid losing context at boundaries. Typical overlap: 10-20% of chunk size.

Chunk 1: [----content A----][overlap]
Chunk 2:                    [overlap][----content B----]

Without overlap, a question about content spanning the boundary would fail to retrieve either chunk with high confidence.

See Chunking Strategies for detailed comparisons.

Step 3: Embedding

We convert text chunks into vector embeddings. See Embeddings for details.

Step 4: Vector Storage & Indexing

Embeddings are stored in a Vector Database optimized for similarity search at scale.

Challenge: Given a query vector $v_{q}$ , find the $k$ most similar vectors among potentially millions of stored vectors. This is done using Approximate Nearest Neighbor (ANN) Algorithms.

See Vector Databases for details.

Metadata Storage: Store metadata alongside vectors for filtering: This enables queries like: “Find similar chunks from documents published after 2023”.

Step 5: Retrieval

When a user submits a query, we find the most relevant chunks.

Embed the query: $v_{q} = f_{θ} (query)$
Search the index: Find top- $k$ vectors closest to $v_{q}$
Return chunks: Fetch the original text associated with each vector

$k$ value	Pros	Cons
Small	Precise, less noise, cheaper	May miss relevant information
Large	Higher recall, more context	More noise, Lost in the Middle effect, expensive

Ensure to pre-filter by metadata before similarity search. (To guarantee k closest vectors are metadata-relevant)

Step 6: Generation

The retrieved chunks are assembled into a prompt and sent to the LLM.

A sample RAG Prompt Template:

You are a helpful assistant. Answer the user's question based ONLY on
the following context. If the answer cannot be found in the context,
say "I don't have enough information to answer this question."

Context:
{retrieved_chunks}

Question: {user_query}

Answer:

Prompt Considerations:

Grounding instruction - “Answer ONLY based on the context” reduces hallucination
Uncertainty handling: - Telling the model what to do when context is insufficient
Request citations - “Cite the source document for each claim” improves verifiability

Context Window Management: If retrieved chunks exceed the context window:

Summarize: Compress chunks before insertion
Rerank and select: Use a Re-ranking model to keep only the most relevant

Practical Application

Naive RAG is sufficient for

Prototyping: Quick POC to validate feasibility
Simple Q&A: Single-hop questions with answers contained in one chunk
Well-structured data: Clean documents with clear topic boundaries
Limited budget

When NOT to Use Naive RAG

Multi-hop reasoning: Questions requiring synthesis across multiple documents (use Multi-hop Reasoning)
High-precision domains: Legal, medical, financial where errors are costly (add Re-ranking)
Keyword-sensitive queries: Specific IDs, codes, names (add Hybrid Search with BM25)
Large-scale production: When retrieval quality directly impacts business metrics

Latency & Cost

Latency: Query Embedding and Vector search are generally much faster compared to LLM generation in these pipelines.

Cost Drivers:

Embedding API calls: Per-token pricing
Vector DB hosting: Storage + queries
LLM tokens: Input (retrieved chunks) + output (generated answer)

Potential Optimizations:

Cache frequently-asked query embeddings
Compress/summarize chunks to reduce LLM input tokens
Batch embedding calls during indexing

Comparisons

Why “Naive” Falls Short

Problem	Description	Advanced Solution
Low Precision	Retrieved chunks might be semantically similar but not actually relevant to the question	Re-ranking with cross-encoders
Low Recall	Answer spans multiple chunks or uses different terminology	Hybrid Search, Query Transformations
Lost in the Middle	LLMs tend to ignore information in the middle of long contexts	Reorder chunks (important first/last), reduce $k$
Ambiguity	”Apple” the fruit vs. the company; “Python” the language vs. the snake	Metadata filtering, Contextual Retrieval
Multi-hop Failure	Questions like “Who founded the company that acquired X?” require chained lookups	Multi-hop Reasoning, Agentic RAG
Stale Data	Vector DB contains outdated information	Index refresh pipelines, timestamp filtering

Resources

Tools & Libraries

LangChain - Orchestration framework
LlamaIndex
FAISS - Efficient similarity search

Back to: 01 - RAG Index

Aayush's ML & AI Notes

Explorer